Jim Susinno Maneesh Dewan 600.475 Machine Learning (Fall 2002) EVALUATION CRITERIA FOR REPORT: * Statement of problem (15%) * Approach (20%) * Background and related work in field (15%) * Results (15%) * Discussion and analysis of results (20%) * Relevant future work (5%) * Construction and readability of report (10%) background and discussion of previous work done related to the topic, ------------------------------------------------------- a clear description of the problem to be solved, ------------------------------------------------------- discussion of the approach taken ------------------------------------------------------- Our first step was to implement the neural nets. We used a two-headed approach, replicating our experiment in C++ and in Matlab. The general algorithm is presented in algorithm 1. To initially test the neural networks, we employed simple classification examples such as xor. Observing the training sessions allowed us to make sure that our implementation was solid enough to use as a foundation to build upon. Now confident in our neural nets' abilities, we proceeded to a simple image example. We used hand drawn images of chevrons(v^) as training input to the network to see if we could learn the upward-pointing pattern. For computational efficiency, we restrict ourselves to images of size 20x20, and networks with 20 hidden units, 20x20x20 == 8000 weights in the first layer of the net. We used OpenGL to display the connection weights of the network as red and green lines in 3d space, allowing us to confirm visually if our network had learned the image presented. In positive cases, it is possible to visually see the input left by the training images in the network. This image bears a striking resemblance to the input images if the pattern is learned correctly. The next step in perfecting the learning process was the addition of negative training examples. Our cursory approach was to add a number of random images to a negative test set, consisting only of noise and expecting an output of 0.0(not face). We quickly found that such images do not benefit the learning process at all, due to the random structureless nature of the images. A network presented with a set of positive inputs did no worse than one with the same positive inputs in addition to some random examples. Our next approach was to add a number of all black images to the training set. This approach proved much more beneficial, and clearly provided a benefit to the network's ability to classify. The flat black image provided a baseline for the network to compare to when classifying. Intuitively, this makes sense: a black image will "quiet" all input weights in the network equally, in contrast to the face images, which excite ONLY the critical face weights. Training on a random image would have an effect very similar to just adding a random component to all the weights, adding no structure to the classifier. --- tweaking inputs --- Now that our classifiers ability to learn images has been proven, we focus on faces specifically. Using Matlab, we carefully filtered and equalized our input images from datasets obtained online. We also implemented a highly simplified method of choosing negative training examples(non-faces) in which we hand-cropped non-face portions of some images we had, resized them to 20x20, performed the same equalization steps and saved them to a negative training folder. The face detection system is now fully functional. Using datasets obtained online from AT&T amd CMU, we created a custom, equalized set of 386 training and test images. We use this set as a baseline for evaluating different network parameters and their effectiveness in learning face patterns. The first parameter to vary is the number of hidden layers. The trivial case of zero provides a classifier of minimal classification ability. Adding a hidden layer of size 10 increases accuracy no more than the addition of bias in the 0 hidden layer case. A layer of size 20 brings error to 0.39, 0.49, slightly lower. in depth discussion of any algorithms used or developed ------------------------------------------------------- The backpropagation algorithm was central to the project's operation. ------------------------- Algorithm 1 ------------------------ Backpropagation was implemented from the equation on p.98 of Tom Mitchell's _Machine Learning_: for each in training examples do: input instance x to the network and compute the output o of every unit u in the network for each network output unit k, calculate its error term delta_k delta_k <- o_k(i - o_k)(t_k - o_k) for each hidden unit h, calculate its error term delta_h delta_h <- o_h(1 - o_h) * Sigma(k1 is presented as input to the network. An output vector of a fixed size m is calculated by summing the contributions of all the input values multiplied by their corresponding weights. This process is repeated over all the layers of the neural network, feeding the output of the upstream layer to the input of the downstream layer. Every layer consists of a nxm matrix of weight values. Backprop relies on the delta-rule learning algorithm for neural nets. Delta rule is conceptually a very simple algorithm for learning in neural networks. Essentially, the delta learning rule corrects a neural network's weights so the next time it is presented with a particular example(for which the correct classification is known), its output will be closer to that known correct output, which is also presented with the example. It accomplishes this by crossing the target output vector with the input vector(this is the one layer case) to obtain a matrix that would correctly calculate the target output from the input. It then adds these resulting "delta" weights into the network's weight matrix(scaled by a coefficient eta, or learning rate). A small eta(0.1-0.5) is preferable so that the network might find a good compromise of weights among all of its training examples without overlearning the last presented example, producing a thrashing effect. In the multi-layer network case where we are dealing with hidden layers, delta rule is not as straightforward. In particular, it is difficult to determine what the values of the hidden nodes should be to produce a desired output. In fact, they could be any number of values and the network could still feasibly learn the pattern. This is the nature of the neural net's learning ability, to abstract hidden details of input patterns into hidden nodes. These details are often not perceivable by humans, but they are nonetheless effective means of lerning how to classify inputs. For any given set of input-output pairs, the neural net can (ideally) learn the aspects of a positive set that distinguish it from the negative set by adjusting its weights according to error gradient - and can do so with an entirely different set of hidden weights every time. This is where the notion of backprop becomes essential. In order to determine a reasonable expected value for the hidden layer vector, we take the target output vector and reverse-propagate(hence the name) its values through the weight matrices of the network. This operation is not reversible; an output vector produced by the matrix can not be reverse propagated through the net to recover the original input vector. Given this fact, it may seem unreasonable to expect that a neural network with hidden layers could learn anything at all, but in practice it is possible. It is however a very delicate process! Often times, a neural net's error must go up before it can come down. For some of the simpler examples viewed graphically, we can observe this phenomenon occurring. For instance, we can watch the xor network learn increasing negative weights for its first layer, and its error rate will remain high. But once the network "figures out"(and this is a largely random process) a set of weights that works for xor(which is -- +- -+ ++ as input weights to the 4 hidden nodes), the error begins to drop dramatically and approaches zero. This aspect of neural nets' learning makes a training run heavily dependent on the initial seeding of the weight matrices. The most common approach is the one we adopted, using small random values from 0.0 to 0.5. Varying this parameter is another possibility of system parameterization, one which was not explored due to time constraints. Extra special thanks to TA Bodhisattva Debnath for helping us outside of class to grasp these principles. detailed presentation of the results obtained ------------------------------------------------------- While watching the image classifying network learn in real-time, certain interesting aspects of neural networks became apparent. First of all, they are inherently unstable. The very same neural network under similar(random) initial conditions can sometimes learn and sometimes not learn a particular pattern correctly. Sometimes the network will converge to a training set error of 0.5 and have a very difficult time getting below that. We believe that this effect can be reduced with the addition of bias to the network, but as a temporary measure we introduced the ability to "jostle" the network by adding a small random value to every weight in the network. Another interesting characteristic of these neural nets is that they have the potential to learn the face pattern in either of 2 polarities, positive or negative. It is as yet undetermined whether one polarity can represent patterns more accurately. discussion of the importance and implications of the results ------------------------------------------------------- the implications are that it is entirely possible for a 2 layer neural network to learn the pattern of a generic human face of size 20x20. These results indicate that a face detection system is highly possible and can be very efficiently implemented and deployed on today's hardware. The results also indicate thet parameter choice for neural networks can have profound effects on a network's ability to classify. The dimension of this parameter space is high, implying that the amount of research that could be performed on such networks is vast. An important fact to taske into consideration when analyzing the results is that a test run is not guaranteed to converge. It is possible that a run will not achieve convergence when in fact it was possible with a different random seed. directions for future work ------------------------------------------------------- - explore different seeding techniques - random threshold != 0.5, maybe something else... - explore convergence behavior of networks, perform n random seedings and count the number of times a network reaches convergence - perfect the annealing technique; explore heuristics for most effective learning - perform an exhaustive analysis of hidden layer sizes, build a wrapper program to iterate - detection of images other than faces, characters, hands, flags, etc. - higher resolution images - online processing of input images