COMP572+Neural+Networks

= Neural Networks Project 1 =

Date: 5/25/08

Student: Michael Weingarden

Course ID: COMP572 Neural Networks

Semester/Year: Summer 2008 Title/Topic: Project 1, KWTA, K-Winner, Winner Take All Neural Networks Getting Started To get started, I looked at Susan Zabaronick's work on the assignment (this can be found in my references below or via Dr. Wolfe's link page). She did a fantastic job. I have to admit that when I was looking at the required neural equations, I was not sure how to implement the algorithm. She did a great job at implementing the algorithm and she made it very easy to to see how the K-winner (winner takes all) network works. I downloaded her source code and played with it a bit to get a feel for how the network reacts to various parameters. Also, although I wasn't confident enough to try implementing the network algorithms from scrath, I did attempt to provide some value added to her work by converting her matrix operations to JAMA (an open source matrix package for Java) and one other neat thing. I liked how she ran the network and then dumped the data to a text file. I also figured out how she produced all of her graphs using MS Excel and I was able to duplicate her results. However, I felt like it would be interesting to be able to run the network and produce the graph on the fly. So, I tacked on an open source chart package (jFreeChart). As a result, my version of Susan's application can produce neat KWTA graphs on the fly. Unfortunately, I'm so slow, I wasn't able to add all the bell's and whistles, so my app does not allow you to change the parameters on the fly (you must do this through Eclipse) and I only set the app up to produce graphs of neurons versus activations values (not net inputs or energy levels). If you're interested I've posted the source code plus external JAR library files and a standalone JAR file that can be downloaded and run just by double clicking. Although you cannot change the parameters on the fly in the standalone JAR, the activations are random, so the program does produce a variety of pretty graphs. [|Eclipse project archive file with external JAR files] [|Standalone application (download and double click to run)] If you don't have the patience to download and run these things, here is a snapshot of what the application can produce: 1. Does the energy decrease at each iteration? The Energy function should decrease in value at each step of the iteration if you observe that this is not true then there is something wrong with your implementation. Plot the Energy as a function of t for cases: step1, step2, step3, step4 and ext1 That's 4 plots, but it would be nice to have all 4 on the same axes for comparison purposes. It's late Wednesday night and I just realized that Susan does not have plots of the energy levels changing based on step (only changes in external input). So, I'll have to generate the graphs myself tomorrow- please excuse me on that one Dr. Wolfe. However, you can see from Susan's plots ( [] ) that energy levels do decrease as the external input is increased. [|Back to Top] 2. Does the network converge to a state that corresponds to "k-winners"? For example, in the case e = 1.5 there should be 2 "winners", and the 2 winners should correspond to the two neurons that had the highest initial activations. Demonstrate this by showing what the initial activations were, and what the final convergent state was for a single run (random initial activations) of each of the parameter cases. (Note: that's 12 cases). The network definitely does converge to a state that corresponds to k-winners. When the external input (ext) = 0.5, you get one k-winner. For 1.5, you get two winners; 2.5, you get 3 winners. In the image above, I set the external input to 2.5 and you can see the 3 winners (looks like Neurons 2, 3 and 15)- unfortunately, I forgot to display the external input value in the title bar of the frame, maybe tomorrow. And the winners do correspond to the neurons with the highest initial activation. So this brings up a couple of questions in my mind: 1. What is the significance of the external input and why does changing it by 1.0 add 1 new k-winner? 2. If the neurons with the highest activation levels always win, why bother with a neural network? Why not just do a bubble sort? (hope this question doesn't make me look too oblivious) Also, a couple of neat observations: I tried running the network with many fewer neurons (like 5) and the external input still has the same effect on the number of winners. Also, Susan mentioned that she required 5000 iterations in order for the network to reach complete convergence. But in my experience, the number of iterations varied with the activation levels (and the activation levels are random). Also, the number of iterations required seemed to be impacted by the step and by the number of neurons. [|Back to Top] 3. Try a case where the initial activations are all equal (non-zero) a. Is there convergence to winners? b. Does the Energy decrease at each iteration? c. Demonstrate your responses by showing a specific simulation run: -- Initial activations -- Activations at convergence -- Energy plot According to Susan's plots, it's obvious that with all initial activations equal, there are no winners and that seems to make sense. After all, if winners have to have the highest activation level, then how can you have a winner when they all have equal initial activation levels? This question developed a question in my mind: Pattern recognition seems to be a common application for neural networks. Patterns are learned and then later new patterns are run through the NN and an attempt is made to recognize the pattern. In the KWTA network, what is the pattern input? Is is the initial activation levels? Is it the external input? (Doesn't make much sense if this is the same for all neurons). Is it the net input? [|Back to Top] 4. Show (mathematical proof) that the k-winner states are stable equilibrium when the external input satisfies: k-1 < ext < k Hint: K-winner states are the corners of the hypercube corresponding to k neurons at maximum activation and the rest at minimum activation. To show that these states are stable equilibriums you must imagine the activation vector as being very close to, but not on, the hypercube corner. Then show that the dynamics of the network will "drive" the activation vector into the corner, as opposed to away from the corner. That is, if activation is near the maximum, then the update will add a positive amount, and if an activation is near the minimum the update will add a negative amount. I'll be honest, this sort of rigorous math proof is just not my bag. I am anxious to see other students' approaches besides Susan Zabaronick's. [|Back to Top] Questions These are just questions that occurred to me as I experimented. Some of these are repeated or expanded upon above. [|Back to Top] References
 * [|Getting Started]
 * [|Does the energy decrease at each iteration?]
 * [|Does the network converge to a state that corresponds to "k-winners"?]
 * [|Try a case where the initial activations are all equal (non-zero)]
 * [|Show (mathematical proof) that the k-winner states are stable equilibrium when the external input satisfies: k-1 < ext < k]
 * [|Questions for further consideration]
 * [|References]
 * I tried generating data for many fewer neurons (only 5, step 0.1, ext 3.0) and I ended up with one neuron converging on activation of approximately 0.8 (instead of 0 or 1). Why is that?
 * After looking at the data, it's obvious that the lowest initial activation level is going to lose (or the highest activations levels will win). So, why bother with the fancy neural network? Why not just do a bubble sort or a minmax function?
 * Neural nets are good at pattern recognition. What is the K-winner neural net used for? Are the initial activations supposed to be the input pattern? If so, is this project an exercise in training or recognition? If the initial activations are the input pattern, what are the net inputs?
 * I noticed that when you increase the external inputs by 1, you get an additional winner. Why is that? Is the ext input like a threshold? As you increase the threshold, you get more and more winners? Is it just a coincidence that the number of winners goes up by 1 everytime you add 1 to the ext input? Why did I have an activation go to 0.8 when I tried an ext input of 3.0 (instead of 2.5 or 3.5)?
 * [|Susan Zabaronick (God bless this woman)]
 * [|More about energy]

= Neural Networks Project 2 =

Date: 6/20/08

Student: Michael Weingarden

Course ID: COMP572 Neural Networks

Semester/Year: Summer 2008 Title/Topic: Project 2, The Traveling Salesperson Problem Introduction to the Traveling Salesperson Problem
 * [|Introduction to the Traveling Salesperson Problem]
 * [|The Code]
 * [|Results]
 * [|Convergence]
 * [|Questions for further consideration]
 * [|References]

code A. An introduction that describes the TSP problem and the general idea for the neural network approach. Include observations you made after seeing the simulations, including any ideas you have for what made, or   could make, the network better (get shorter tours more consistently in    fewer iterations, etc.). code The Traveling Salesman Problem is one of several NP complete problems that is very difficult to solve using a computer. This is because the computing complexity increases rapidly as cities are added. Even with enormous processing power, it would take a long time to find an optimal answer using brute force. So, I'm happy to explore the possibility of analyzing such a problem with an alternative strategy such as a neural network. One glaring observation that I made is that it is quite easy to find convergence using a circular arrangement of cities, but it is nearly impossible to achieve convergence with a random arrangement. This is especially interesting because, as a human, it is visually "obvious" when a particular arrangement looks like an optimal arrangement. Also, by comparing distances or incremental improvement in distance, it seems like it would be simple to get to close to optimal results. So, I'm curious to find out how neural networks can be improved upon to get desired results. [|Back to Top] The Code

code B. A copy of the code. code Below is a link to the code that was used for graphical and console based simulations. I am not claiming originality, I borrowed Susan Zabaronick's code. However, I did provide some value added by adding some functionality to the console application. I wanted more progress information as the network was running and I wanted more granular data regarding convergence. So, I added some code that allowed more information to be displayed while batch running the neural networks. You can see the results of this code in the [|Convergence] section. [|Eclipse project archive file with external JAR files] [|Back to Top] Results

code C. Show the specific results for running the network twice for each case (case1, case2), with the following:

1. City positions. 2. Tour that resulted. 3. Drawing of the tour. 4. Length of the tour. 5. Final activations of the network. 6. Number of iterations for convergence. 7. A plot of the energy as a function of iteration. code Below, I have links to several of the trial runs generated from the simulation designed by Susan Zabaronick. The activation colors represent relative levels of activations. The white squares represent relatively lower activation levels and the dark blue squares represent relatively higher activation levels. [|Back to Top] Convergence
 * [|Case 1 (circle), Trial 1:]
 * [|Case 1 (circle), Trial 2:]
 * [|Case 2 (random), Trial 1:]
 * [|Case 2 (random), Trial 2:]

code D. Convergence Issues:

1. What "convergence criteria" did you use? How did your program know when to stop the simulation run?

2. It is assumed that you did more than a couple of simulation runs. Approximately what percent of the time did the network converge to           an ambiguous state (no clear winner in each row and column)? code 1. The program ran until the convergence threshold was reached for at least one neuron in every row or column or until 10,000 iterations had been tried. 2. Below I have included to some links to files created by the console app mentioned above. These files show all of the network runs for the Case 1 (circle arrangement) with 0.6, 0.7 and 0.8 thresholds for neuron activations and the same for Case 2 (random city arrangement). If you look at the raw data in the files below, you'll see several columns separated by commas. The first column is just the progress markers that are output to the console so that there is some kind of feedback while each network is running. The other columns are iterations, run number, threshold, distance and converged. Iterations is the total number of network iterations to get to convergence. Run number is just an index. I threw in distance so that I could see what the distance was whether the network converged or not. And, finally, I added in convergence as a 0 or 1 to make analyzing the data numerically easier. Here's a summary of the raw data (100 runs of each, so numbers represent percent): The above results show that it's a lot easier to make the circular arrangement converge than a random arrangement. Also, it's interesting that for the circular arrangement, you get better convergence results with a 0.7 threshold than you do with a 0.6 threshold. And, since I included distance in the raw data files above, you can see that even though some networks did not converge for circles, all the runs ended with the optimal distance anyway. [|Back to Top] References
 * [|Case 1, Threshold: 0.6]
 * [|Case 1, Threshold: 0.7]
 * [|Case 1, Threshold: 0.8]
 * [|Case 2, Threshold: 0.6]
 * [|Case 2, Threshold: 0.7]
 * [|Case 2, Threshold: 0.8]
 * **Arrangement** || **Threshold** || **Converged** || **Not Converged** ||
 * Circle || 0.6 || 88 || 12 ||
 * Circle || 0.7 || 99 || 1 ||
 * Circle || 0.8 || 57 || 43 ||
 * Random || 0.6 || 0 || 100 ||
 * Random || 0.7 || 0 || 100 ||
 * Random || 0.8 || 0 || 100 ||


 * [|Susan Zabaronick]
 * [|Fuzzy Hopfield-Tank Traveling Salesman Problem]

= Neural Networks Project 3 =

Date: 7/9/08

Student: Michael Weingarden

Course ID: COMP572 Neural Networks

Semester/Year: Summer 2008 Title/Topic: Project 3, TSP Genetic Algorithm Introduction
 * [|Introduction]
 * [|The Code]
 * [|The Applet]
 * [|Results]
 * [|Conclusions]
 * [|Questions for Further Consideration]
 * [|References]

I really enjoyed this project. After spending so many hours staring at the requirements and fiddling with the algorithms, I now feel I have a better understanding of genetic algorithms. The best part is that they are no longer the black magic that I thought they were. Even the concept of neural networks seemed a bit like black magic. For example, where did the weights come from and how did the neural algorithms come about? With this project, though, I came to the realization that genetic just means what happens when you take two starting points (parents) and infer a result (a child). After spending a lot of time looking at the distance algorithm, I realized that it was just one idea for how parents might be combined together to produce a child. And it's pretty cool to be able to implement a program that let's you try mass mating and see what the results are. Another thing that came clear from this assignment is that very difficult problems to solve can be stated so simply. I was vaguely aware of this from past experiences with math and computers, but this was a refresher course. After looking at prior student work on this assignment, I decided to focus my efforts on the distance recombination method. Prior students seemed to focus their efforts on the order recombination method, but Dr. Wolfe asserted that distance recombination would provide superior results for the Traveling Salesman Problem. This piqued my curiosity. The first thing that I discovered is that distance recombination is more popularly referred to as edge recombination. I was led to this realization by former student Mark Labovitz (see references below). There is a link to a nice paper (see references below) on edge recombination in the Links section of Dr. Wolfe's web site. I also found a simple article on edge recombination at Wikipedia (again, see references). Since the paper and wikipedia both described a similar algorithm, I chose to go for it, implement the edge recombination algorithm and then compare the results with prior students' work on the order recombination algorithm. I was enticed by how "simply stated" the algorithm was at wikipedia. Of course, that was one book that had a lot of surprises under the cover. I could not believe how much code I had to write to implement such a simply stated algorithm. Just to compare side by side, I've included a link to the wikipedia algorithm and a link to the code I wrote here: Note: The Tour class was mainly written by Susan Zabaronick, however, I did write the createChildDistance method to compare TSP results with the createChildOrder method she wrote. The Neighbors class I wrote entirely myself- which is no bragging matter considering the fact that only a lame brain would choose Neighbor as part of variable, method and property names! Secondary note: if you're curious about the color coding of the code, you should go to my Bunny Trail link below. Notice the difference between the stated algorithm and the amount of code required? I'm sure I'm not the most efficient coder and I hate those of you that implement this stuff in 5 lines using MatLab, however, I do feel that I tried to trim back on inefficiencies wherever I could. Even so, I did not expect over 300 lines of code! Since I leveraged Zabaronick's code, all I had to do was write my edge recombination algorithm as a sort of plugin to her GUI and console apps. This was nice because after all my struggling, it was very easy to visualize the fruits of my labor. After experimenting around, I do believe Dr. Wolfe's assertion that the edge recombination produces superior results to the order recombination (at least for TSP). I did find cases where order was as good as or superior to edge, but edge was far better in some cases. [|Back to Top] The Code
 * [|Wikipedia algorithm for edge recombination] (12 lines)
 * [|Neighbors class] (300 lines)
 * [|Tour class] (60 lines dedicated to edge recombination)

Here is a link to my Eclipse archive file for the project. For those of you that are new to Eclipse (like me), you can download the code, use File -> Import and choose to import the archive (zip) file directly as an Eclipse project. It's pretty cool. Much of the code is from Susan Zabaronick's prior project. I added the distance (edge) recombination algorithm and the hooks to make it work from the GUI and console apps. [|Eclipse project archive file gatspMW1.zip] If you'd like to try running the application from the privacy of your own home, you can download this JAR file and run the program anytime: [|Genetic Algorithm TSP application] Or, you can scroll down a little bit and try running the program as an applet from this web site. I'm still very new to object oriented programming (I started programming before PC's were invented), but in this project I learned a lot about OOP and also about how to use Eclipse to make coding easier. Ultimately, I came to this conclusion: I love Eclipse for Java! I ran into many problems trying to stay organized while flipping back and forth between various classes and their methods. So, if you're interested, I'm creating a little bunny trail explaining my adventures and tips and tricks I learned along the way: [|Genetic Algorithm TSP Bunny Trail] [|Back to Top] The Applet

[|The Applet] [|Back to Top] Results

code Case 1: Order Method a. total generations (iterations) b. drawing of the best tour found c. plot of the length of the best and worst tour at each generation (gives the range of fitness values for each              generation). d. what do you think is the "best" mutation rate for this problem? Explain. code I'm including links to several "pretty pictures" that provide a, b and c from above. I'll start with just the pictures for the 0.1 mutation rate. But before you look, here's a quick summary of what to look for: The algorithm starts by generating 20 random tours (either in a circle or in random positions). The tours are sorted from best fitness (tours[0]) to least fitness (tours[19]). Fitness is judged by tour distance- least is best. Then a child is formed by order recombination with swap mutation. The child is tacked on as the last tour (number 20) to replace the least fit tour. Then all the tours are again sorted from best to least fitness. This process is repeated as new children are spawned. When you look at the graphs, you will see the best and worst tour (out of the population of 20 tours). These values are plotted every 100 network runs (iterations). The tour drawing that you see is the final and best tour produced. If you look in the text boxes below, you will see the number of iterations. With the circle tours, I stopped the algorithm early so that you can see nearly exactly at which iteration the tour became optimized. With the random tours, I let the algorithm run until 10,000 iterations so you can see whether the best tour improved much after many iterations. To answer part d from above, I ran the network 100 times using the console application. I ran using a circle arrangement of cities so it would be obvious whether a tour was optimal or not. Here are the results for several mutation rates: code Case 2: Distance Method a. total generations (iterations) b. drawing of the best tour found c. plot of the length of the best and worst tour at each generation (gives the range of fitness values for each             generation). d. what do you think is the "best" mutation rate for this problem? Explain. code This is the edge recombination algorithm that I implemented to compare the results to the order method above. Again, there are several pictures which explain a, b and c above. These images are only for a mutation rate of 0.1: With the random tours, it's difficult to do an apples to apples comparison with the order method- because the tours aren't the same and that may impact difficulty in optimizing. However, it's interesting to notice the difference in the circle tours. Again, I stopped the tour when optimization was visually obvious. You can see that the number of iterations required for the circular tour were many fewer with the edge recombination versus the order recombination. For part d above, I ran the network 100 times for each of the following mutation rates: [|Back to Top] Conclusions
 * [|Circle first try]
 * [|Circle second try]
 * [|Random first try]
 * [|Random second try]
 * [|Mutation rate: 0.00] (0 optimal tours)
 * [|Mutation rate: 0.02] (82 optimal tours)
 * [|Mutation rate: 0.04] (96 optimal tours)
 * [|Mutation rate: 0.06] (99 optimal tours)
 * [|Mutation rate: 0.08] (100 optimal tours)
 * [|Mutation rate: 0.10] (94 optimal tours)
 * [|Mutation rate: 0.12] (72 optimal tours)
 * [|Circle first try]
 * [|Circle second try]
 * [|Random first try]
 * [|Random second try]
 * [|Mutation rate: 0.00] (0 optimal tours)
 * [|Mutation rate: 0.02] (93 optimal tours)
 * [|Mutation rate: 0.04] (100 optimal tours)
 * [|Mutation rate: 0.06] (100 optimal tours)
 * [|Mutation rate: 0.08] (100 optimal tours)
 * [|Mutation rate: 0.10] (95 optimal tours)
 * [|Mutation rate: 0.12] (65 optimal tours)

It looks like the best mutation rate for the order recombination method was about 0.08. For the distance (edge) recombination, it looks like 0.06. For both types of recombination, you can't have too much mutation or too little. One last thing to note, it seemed like the distance recombination method performed better even when the mutation rate was off a little. [|Back to Top] Questions for Further Consideration

[|Back to Top] References
 * Does UML really help? Being new to OOP, I found it quite confusing to figure out where I was at times. I was looking for a way to organize myself better and UML looked like one alternative. In the old days, we used use flowcharts and UML looked like it might be like a flowchart. Unfortunately, I didn't find much consolation in UML. It's not really like flowcharts except where inheritance is concerned. So, how do programmers organize their thoughts these days? I also tried mind mapping software called Freemind which is good for organizing concepts, but not great for flow. Any thoughts would be appreciated.


 * [|Susan Zabaronick]
 * [|Extensive paper on edge (distance method) recombination (PDF file)]
 * [|Simpler description of Edge Recombination]
 * [|Mark Labovitz]

= Neural Networks Project 4 =

Date: 7/15/08

Student: Michael Weingarden

Course ID: COMP572 Neural Networks

Semester/Year: Summer 2008 Title/Topic: Project 4, Backpropagation Introduction
 * Output Neuron Activation Surface Plot**
 * [|Introduction]
 * [|A. Copy of the Code]
 * [|B. Describe how the training set was chosen and how the training was done]
 * [|C. Value of "step size" (called "gain" in Lippmann)]
 * [|D. Value of "momentum" term]
 * [|E. Explain how you updated the weights]
 * [|F. How many points were required for network to learn the difference between types A and B?]
 * [|G. After network was trained, what % of the time did it classify correctly?]
 * [|H. What decision criteria was used? How did you decide the network was classifying the input as A or B?]
 * [|References]

This assignment involved the construction of a neural network that uses backpropagation as a training mechanism. The concept of backpropagation is very simply described in the class notes on p.482 (right after the problem description on p.481). You have training inputs with known desired outputs. The inputs are fed forward through the system and at the end the output is checked against the desired output. If there are errors, the errors are propagated back through the network and weights between all neurons are readjusted until few errors result when working with training data. [|Back to Top] A. Copy of the Code

As usual, I depended heavily on prior work by Susan Zabaronick, but this time, I also benefited from the code of William Clements who did some neat things that I wanted to incorporate. All of the code I used can be downloaded as an archived Eclipse project. If you want to try it out for yourself, just download the zip, create a project in Eclipse and then Import... -> General -> Archive File. Here's the zipped project file: [|Eclipse project archive file backp3.zip] If you'd like to try running the application from the privacy of your own home, you can download this JAR file and run the program anytime. Here is the JAR file: [|Backpropagation application] **Warning:** I do recommend putting it in a folder before running it. When you choose to produce surface plots, many files will be created in the folder where you run it from. After looking through the class notes and prior student work, I became obsessed with the surface plots of output neuron activation levels. The color pictures on the web sites were beautiful and this piqued my interest. As usual, I gravitated towards Susan Zabaronick's work because I like her GUI implementations. Unfortunately, Susan did not include a mechanism to produce the data required for the surface plots. Fortunately, another student, William Clements, did. After looking at both students implementations, I realized how beneficial both efforts were in visualizing how backpropagation worked. So, I set about the task of implementing the surface plot function in Zabaronick's GUI app. Of course, this was not a trivial undertaking. The two coding styles were very different. Ultimately, the effort led to a much better understanding of how backpropagation works. William Clements work looked like it was based off the work of Michael Parry in the class notes. In particular, I was wondering how he was able to produce the data for the surface plots. This led to the question: do the surface plots evolve during training or testing? The way Michael Parry implemented the code, the surface plots evolved from a hybrid of the two. While the system was training, Parry ran some code that tested the current state of the network. He called his test a "probe." The thing that baffled me for days before I scrutinized Parry's code was: how do you generate surface plots from random test points? Parry helped me to realize that you can't. So, in my implementation of the code, I had two types of tests: one test would run 100 random points through the net after training was finished, the other would run 1600 evenly spaced coordinates through the net every so many epochs during training. With the 100 random points, I could easily calculate the percent correct. With the 1600 evenly spaced points, I could produce surface plots throughout training to visualize the evolving pattern of output activations. Below, I have links to the code required to implement the training and the tests: Also, just for fun, here is page with a set of pretty surface plots and the associated network parameters and results: [|Backpropagation Surface Plots] B. Describe how the training set was chosen and how the training was done
 * [|Test of 100 random points]
 * [|Test of 1600 points spread evenly throughout square zone]
 * [|Training algorithm]
 * [|Bridge code to marry my test algorithm with Zabaronick's training algorithm]

The training set was comprised of 1000 randomly located points within the 4 x 4 perimeter of the square. The training was done in an iterative fashion using the Lippmann algorithm as explained in the class notes (p.455 to p.459) and from the Lippmann IEEE PDF article (see references below). The Wikipedia entry for Backpropagation explains the process quite simply: Summary of the technique: code 1. Present a training sample to the neural network. 2. Compare the network's output to the desired output from that sample. Calculate the error in each output neuron. 3. For each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error. 4. Adjust the weights of each neuron to lower the local error. 5. Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to     neurons connected by stronger weights. 6. Repeat the steps above on the neurons at the previous level, using each one's "blame" as its error. code Interesting note I just stumbled upon. In the reference below (Momentum explained), I discovered this statement: "During a training epoch with revision after a particular example, the examples can be presented in the same sequential order, or the examples could be presented in a different random order for each epoch. The random representation usually yields better results." The method used for training here involved the same sequential order of examples. It might be interesting to revamp the training algorithm to use randomly presented examples for each epoch. Here you can see how the output activation levels evolve over time during training: [|Backpropagation Surface Plots]. [|Back to Top] C. Value of "step size" (called "gain" in Lippmann)

After reading prior students' answers to this question, I began to wonder whether I had interpreted the question correctly. So, I will answer it in two ways. The first way is to address the**benefit** of "step size" (or gain). The second way is to address "step size" numerically as prior students have. To see the effects of varying initial weights, step size (gain) and momentum, you can look at an excellent table produced by Susan Zabaronick: []. Using this table, Zabaronick was able to figure out the "best" values to produce networks that converged. One interesting note here with respect to convergence- there is no absolute indicator. In the GUI application, you can set Decision Threshold and Training Threshold. The Decision Threshold is used throughout the training process to determine whether individual neuron weights need to be adjusted. The Training Threshold is used to determine whether the network has converged. This means that convergence is relative based on the Training Threshold value. Ultimately, you want to set a convergence level that produces a high level of correctness when the network is tested. However, again, a high level of correctness is relative. Is 90% high? Is 99% high? For that matter, is 70% high? So what impact does gain have on the network? With respect to speed and convergence? I wasn't entirely sure how to compare the system with and without gain. I made the assumption that "no gain" meant a gain = 1. With that in mind, it seems like gain is more of an attenuation than a gain. So, I tried the network with a gain of 1, 0.1, 0.5 and 0.9. After seeing what these do, I decided to try 0 (zero) as well (just in case that's really what "no gain" means). I have my results in pictures which you can look at below. The gain of 1 produced a sum squared error function which looked like it was never going to converge (or at least take a long time). The gain of 0.1 gave almost instant convergence. The gain of 0.5 With respect to stability? With respect to correctness? One interesting thing to note: even thought only one setting achieve convergence, all settings produced correctness results of 90% or better! So, what's more important, convergence or correctness, and what level of correctness is good enough? [|Back to Top] D. Value of "momentum" term
 * [|Error function for zero gain] (no backpropagation)
 * [|Error function for 0.1 gain] (best gain setting for speed and convergence)
 * [|Error function for 0.5 gain] (less stable, no convergence on this run)
 * [|Error function for 0.9 gain] (less stable, no convergence, higher sum squared error)
 * [|Error function for 1.0 gain] (unstable, no convergence, high sum squared error)

As with "step size" above, I will address this question in two ways. First, the **benefit** of "momentum" and, second, what **values** were used. To me, the resulting benefit of the use of the momentum is the most interesting (engineer). In the class notes, I only found one reference to the momentum term (p.459, Summary using Lippmann's Notation). On that page, it simply says that the momentum term provides a smoothing effect. Of course in running the network dozens of times, I noticed that the momentum seemed to make the sum squared error smaller in fewer epochs. So, I searched about the Internet for an opinion to prove or dispel this observation. I found an excellent page that describes the momentum term in great detail (see references below). At that web site, momentum is described in the following manner: "empirical evidence shows that the use of a term called momentum in the backpropagation algorithm can be helpful in speeding the convergence and avoiding local minima." The avoiding local minima seems to relate to the smoothing effect described in the class notes. On to the empirical results. A momentum of 0.1 did seem to cause some nice smoothing of the sum squared error function, but 0.5 and 0.9 seemed to cause instability and oscillation. See for yourself: [|Back to Top] E. Explain how you updated the weights
 * [|Error function for 0.1 momentum] (best momentum setting for stability)
 * [|Error function for 0.5 momentum] (instability and no convergence)
 * [|Error function for 0.9 momentum] (instability, surprisingly fast convergence, but lower correctness than 0.1 during testing)

There were three different kinds of weights: initial weights, lower weights and upper weights. The initial weights were all randomly set to some small positive or negative number with respect to the user entered initial weight range. The lower weights were the ones between the input and hidden nodes. The upper weights were between the hidden nodes and output nodes. The weights were changed by adding the gain and momentum terms. If the error for a particular node was low, the weight would not change much, if the error was high, the weight would go up or down as required to attempt to lower the error. [|Back to Top] F. How many points were required for network to learn the difference between types A and B?

I have to admit, I did not try a varying number of points. However, it is obvious that a large number of points must be tried because of the size of the square and the circle. The training used 1000 points and considering the correctness numbers were high during various tests, I'd say that 1000 was a reasonable number. Less than that would make me feel like accuracy would be lost and more than that would be inefficient. [|Back to Top] G. After network was trained, what % of the time did it classify correctly?

Again, what does correctly mean? 99%? 100%? Few of the configurations classified 100% correctly. If you look at the images provided for gain and momentum, you'll see that even with wild instability in the sum squared error function, all of the systems achieve greater than 90% correct classification. So, instead of providing a number for the percent of the time the network classified correctly, I'm running the network 10 times to completion- meaning either convergence or max epochs of 100,000. I'll use what I consider to be the best parameters (like 0.1 for gain and 0.1 for momentum) and I'll provide the data regarding the average correctness. Here are the results from 10 runs of the network. The results were obtained by summing together the results of 100 tests of 100 points each and then averaging: WeightRange : 0.0010, Gain: 0.1, Momentum: 0.1 MaxEpochs : 100000, Training Threshold: 0.1, Decision Threshold: 0.1 Notice that the average correct started to go back up near the end. This gives me a warm fuzzy feeling that the network trained with these parameters does provide about 98% correctness on average. [|Back to Top] H. What decision criteria was used? How did you decide the network was classifying the input as A or B?
 * Run: 1, Sum: 9846, 98.46% average
 * Run: 2, Sum: 19652, 98.26%
 * Run: 3, Sum: 29412, 98.04%
 * Run: 4, Sum: 39197, 97.99%
 * Run: 5, Sum: 48987, 97.97%
 * Run: 6, Sum: 58749, 97.91%
 * Run: 7, Sum: 68567, 97.95%
 * Run: 8, Sum: 78377, 97.97%
 * Run: 9, Sum: 88160, 97.95%
 * Run: 10, Sum: 98002, 98.00%

The decision criteria was user adjustable and broken into two parts. The **Training Threshold** was the criteria that decided whether the network had achieved convergence. It was fun to set this to 15 and watch almost all the networks converge! The **Decision Threshold** was used to determine whether the network was classifying the input as A or B. Since the network has two outputs, one represents the A identifier and the other represents the B identifier. The default threshold for classification is set to 0.1 in the GUI. In order for an input to be classified as A, the output activation level for B would have to be less than 0.1 AND the output for A would have to be greater than 1 - 0.1 (or 0.9). [|Back to Top] References


 * [|Lippmann IEEE Article]
 * [|Susan Zabaronick Backpropagation assignment]
 * [|Susan Zabaronick algorithm page]
 * [|William Clements Backpropagation assignment]
 * [|Dorota Badiere Backpropagation assignment]
 * [|Backpropagation algorithm]
 * [|Wikipedia]
 * [|Momentum explained]
 * Referenced on momentum web page: Bishop, C. (1995) "Neural Networks for Pattern Recognition", Oxford University Press, Oxford, UK, pp.116-149.

= Neural Networks Project 5 =

 Date: 7/23/08
 * Recognition of Highly Noisy Pattern**

Student: Michael Weingarden

Course ID: COMP572 Neural Networks

Semester/Year: Summer 2008 Title/Topic: Project 5, Binary Hopfield Network Introduction
 * [|Introduction]
 * [|The Code]
 * [|1. Determine Pattern Stability]
 * [|2. Attempt to Remedy Pattern Instability]
 * [|3. Test Network with Noisy Versions of Patterns]
 * [|Additional considerations]
 * [|References]

[|Back to Top] The Code

[|Eclipse project archive file hoppyMW.zip] If you'd like to try running the application from the privacy of your own home, you can download this JAR file and run the program anytime. Here is the JAR file: [|Hopfield Binary Net application (hoppyMW.jar)] Some notes on the code. The training wheels are off. I completed an entire project without borrowing significant amounts of code from anybody. However, I still borrowed algorithm ideas and I borrowed an entire class in order to integrate Swing (Java GUI) input and console output. I found it much simpler to implement network input and output visualization via the console, but I still wanted users (that's you) to be able to double click and icon and run the app. Application description. Download the JAR file and double click and here is what you'll experience and what is happening behind the scenes: [|Back to Top] 1. Determine Pattern Stability
 * a form appears that allows the user to enter mutation rate, convergence threshold and pattern number
 * four patterns (bipolar representations of digits 0 through 3) are loaded into the network during training
 * input and output patterns are displayed side by side in the Java output console
 * all four patterns are then run through the network a second time (a test for network stability)
 * again all input and output patterns are displayed
 * the user chosen pattern number is mutated by flipping bits based on the user entered mutation rate
 * the mutated pattern is run through the network
 * neurons are updated, net inputs are re-computed and the process continues until convergence
 * convergence is determined by the user entered convergence threshold
 * the convergence threshold dictates how many iterations the network remains the same before stopping
 * again all input and output patterns are displayed

I thought I was going to be tricky and go right to a stable network. Of course, I discovered that wasn't so easy. When you try to design 0, 1, 2, and 3 patterns, it is difficult not to have a lot of overlap. The more "bits" that two patterns have in common, the greater the chance of having instability. So, I'm including a link to 3 network runs so you can see the evolution of input patterns as I tried to move the network from unstable to stable: One note: these tests were run with only 1 iteration. This was before I figured out how cause the network to iterate, so it's possible that the second "3" input pattern may have been recognized with more iterations. [|Back to Top] 2. Attempt to Remedy Pattern Instability
 * [|Unstable input patterns 1] (notice how the network thinks a 3 is a 2)
 * [|Unstable input patterns 2] (here the network outputs a mutated 3)
 * [|Stable input patterns 1]

In my first attempt to create test patterns, I was scrimpy on the number of bits that I used. I tried to make all the patterns skinny and, naturally, there were many bits that overlapped in the patterns. So, as Dr. Wolfe says, I changed the patterns so that they are "more different." To do this, I made the 2 and the 3 wider, but not quite as wide as the 0. It was difficult to make the 2 and the 3 very different (yet still similar enough to be part of the same font set). For this reason, it was difficult to make a stable network with both the 2 and 3 pattern. Also, while speaking of the pattern design process, I'd like to make a comment on approach. It took me several days to understand how to implement the network for a two dimensional pattern. When you look at the diagrams in the class notes, patterns are represented as one dimensional. Even that was a bit confusing because you have multiple patterns which have multiple neurons and connection strengths that go between every neuron. I started browsing through prior student approaches and I was most please by Sarieva's approach. Here's a snippet of the code I used to create the patterns: Since the Java compiler doesn't care about whitespace, I was able to make the grid into a one dimensional array and yet still arrange it as a two dimensional array. This made it easy to try different designs to improve network stability and yet it was still very easy to implement the training algorithm. Dr. Wolfe suggested that you could use a larger grid in order to differentiate between patterns easily. However, amount of processing required increases drastically as the grid grows. As it is, the 12x10 grid requires 120 neurons and 120 neurons require nearly 14,400 weights to be calculated and stored. With a 50x50 grid of neurons, you would need to have 6,250,000 weights calculated and stored. [|Back to Top] 3. Test Network with Noisy Versions of Patterns

The next step was to allow the user to introduce mutations into the test pattern. With my application, you can enter mutation rate, convergence threshold and test pattern number. The application will then behave as described in the [|The Code] section above. As a result, it is very easy to test all patterns (0 through 3) with all different mutation levels (including 5, 10, 15 and 20 percent). After running the tests many times, it was obvious that 0 and 1 were the most stable patterns even with as much as a 20 percent mutation rate. However, 2 and 3 had more problems, even at lower mutation rates. And 3 was the worst of all, often being recognized as 2- sometimes even when the pattern was input with no mutation. I ran the network manually 10 times for each of the cells in the matrix below: Patterns Recalled Correctly Test Matrix Here's a couple of sample runs of the network: [|Back to Top] Additional Considerations
 * || 0 || 5 || 10 || 15 || 20 ||
 * 0 || 10 || 10 || 10 || 9 || 10 ||
 * 1 || 10 || 10 || 10 || 8 || 8* ||
 * 2 || 10* || 10* || 10* || 10* || 10* ||
 * 3 || 10* || 9* || 9* || 6* || 5* ||
 * indicates mutated output on some or all runs
 * [|Input pattern 0, 20% mutation]
 * [|Input pattern 3, 5% mutation]

[|The Plan] [|Back to Top] References
 * One of the things that stumped me about the understanding the Hopfield algorithm was confusion over the symbols used in the formulas. It took me a couple of days and lots of cross referencing between various papers (and prior student code) before I realized that a[i] and x[i] and neurons were all the same thing. In the Rojas book referenced below, he consistently refers to x[i] and that helped.
 * I also had a hard time differentiating between net[i] and neurons. This was because in previous projects, we had input neurons.
 * I found it very difficult to conceptualize the calculation of the two dimensional weight matrix for a two dimensional neuron matrix, so I treated all the patterns as one dimensional instead of two.
 * It took me a while to figure what value there was in iterating through the network and how to implement an iterative loop for recognizing a pattern. Once I did figure things out, then I spent some time pondering what constituted convergence. Is convergence when the output pattern matched one of the four digit patterns? Is it when the output stops changing? If it's when the output stops changing, how many times must the output be the same before you believe you have convergence?
 * I finally figured out that convergence meant that the output remains the same and that the user would control how many times the output must remain the same before the program stops iterating.
 * I also calculated the energy and had plans to use energy as a way to determine when convergence was reached, but i realized that convergence was usually reached after 10 or fewer iterations, so I didn't bother implementing an energy based threshold for convergence.
 * One last note: after reading the algorithm as described in the project description, I realized that it would be much simpler to understand written in bullet form. So, I did re-write it and here's a link to what I call the plan:


 * [|Rojas, Raul, Excellent chapter on Hopfield (PDF)]
 * [|Rojas, Raul, Entire book on neural nets free]
 * [|Lippmann (PDF)]
 * [|Wikipedia]
 * [|Simple Java Console, RJHM van den Bergh]
 * [|Sarieva Hopfield Assignment]
 * [|Zabaronick]

= Neural Networks Project 6 =

 Date: 7/30/08
 * Self Organizing Map GUI App**

Student: Michael Weingarden

Course ID: COMP572 Neural Networks

Semester/Year: Summer 2008 Title/Topic: Project 6, Kohonen Self-Organizing Map Introduction
 * [|Introduction]
 * [|The Code]
 * [|Implementation Issues]
 * [|References]

I'm running a little late. I'm posting the code below, but I still have to do the write up tomorrow. [|Back to Top] The Code

[|Eclipse project archive file somTSP1.zip] If you'd like to try running the application from the privacy of your own home, you can download this JAR file and run the program anytime. Here is the JAR file: [|Hopfield Binary Net application (somTSP1.jar)] If you click on the link to the JAR file above, you can run the application directly using Java Web Start. You'll see that the application produces very good TSP tours very quickly. Application description. Download the JAR file and double click and here is what you'll experience and what is happening behind the scenes: In its current state, the application creates a graphical map of 30 cities and a tour designed using the Kohonen Self-Organizing Map algorithm. Each time the application runs, 30 new cities are randomly generated. However, during any one invocation, the cities remain static. There are two buttons and both of them cause the SOM algorithm to run and a new tour (of the existing cities) to be displayed. I tried several approaches when designing the GUI. At first, I had two separate sets of randomly generated cities, but that didn't serve much of a purpose. Then, I tried to implement an application that would use the same set of cities to create tours by SOM and by Genetic Algorithm or by SOM and by the first project (the Hopfield-Tank TSP). I could get two algorithms side by side, but I couldn't get the non-SOM algorithms to go through iterations. Because of this, the non-SOM algorithms performed pitifully compared to the SOM generated tour (on the same cities of course). Also, I had a problem getting the non-SOM algorithm displays to show the correct length. All of the things I tried are evidenced in the source code in the zip file above. You can see the GAtsp .java file and the HopTSP .java file and in the Form1.java file, you can see where the panel classes were created that would pull data in from these two classes ( GAtsp and HopTSP ). The intent of the "Run again" button was to allow one algorithm to be rerun. The intent of the "Let the games begin..." button was to initiate two algorithms to be run on one set of cities and to see the results side by side. Disclaimer: much of the code that I used came from former students Sheel and Sarieva. I will discuss that more in the section relating to implementation and interpretation of the algorithm. [|Back to Top] Implementation Issues

code Explain any variation you made from the method as it is described above, and how you resolved any ambiguities, and how you interpreted the parameter settings. Also provide sketches of the resulting tours. You should run the simulation several times and report statistical results. code I have to admit that after reading and re-reading the algorithm as described by Fatava and Walker (and the normalization process as described in our syllabus), I had a difficult time understanding certain network parameters. The biggest problem that I recall was the parameter called, "interaction distance." I wasn't totally sure what this meant. I kept re-reading the Fatava paper, but it wasn't coming clear. So, I searched around on the internet and came up with a diagram that was helpful: This helped me to understand that the interaction distance was "i", but I still wasn't sure how "i" related to the algorithm. So, I dove into former student Sheel and Sarieva's code. Even there, I struggled to figure out what was "i" doing. Even though I'm still not entirely sure, I believe that "i" is used to determine how many neurons to the right and left of the current neuron to update other neuron weights (connection strengths). To me, it's definitely not straight forward. Speaking of which, I did take the time to re-write the algorithm in English to try to understand what the intent was a little better. Here is a link to the algorithm in English as I interpreted it: [|Algorithm description re-written] [|Back to Top] References


 * [|Sarieva Hopfield Assignment]