Você está na página 1de 59

Fishing Nets

Neural Networks &


Convolutional Networks

Colin Togashi
Meng-Hao Li
Jack Shue
Gabriel Fernandez
A DAPTATION & L EARNING , U NIVERSITY OF C ALIFORNIA , L OS A NGELES
P ROFESSOR A. H. S AYED , UCLA, EMAIL : SAYED @ UCLA . EDU
This project was done for UCLA’s Electrical Engineering Department’s Adaptation &
Learning class under the supervision of Dr. Ali Sayed. If made public, photos need to be
purchased for rights.
First release, March 2017
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1 Motivation 6
1.1.1 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Objective 7
1.3 General Approach 7
1.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Theory Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Neural Networks 10
2.1.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Convolutional Networks 13
2.2.1 Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Initial Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Neural Networks 16
3.1.1 Gaussian Distribution Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Step size µ ..................................................... 17
4

4 Convolutional Neural Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


4.1 Weaving Nets 20
4.2 Vertical Horizontal Line Test 20
4.3 Icons Recognition Test 21
4.4 Softmax Vs Cross-Entropy 24
4.5 Numerical Checking 27

5 In Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1 Fish & Datasets 30
5.1.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Weaving Nets 34
5.3 Challenges 34
5.3.1 Computational Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Compromises 35
5.5 Algorithm Adjustments 36
5.5.1 Architecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5.2 Architecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Network Architecture 42

6 Results & Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


6.1 Results 47
6.2 Thoughts 49
6.3 Future work 50
6.3.1 Spatial Pyramid Pooling & Bagging and Boosting . . . . . . . . . . . . . . . . . . . . 50
6.3.2 Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Feedback & Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


7.1 Each Member’s Contribution 52
7.2 Handout Feedback 53
7.2.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.2.3 General Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.2.4 Typos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3 How Much Did We Learn? 56
5

8 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1 References 58
1. Introduction

1.1 Motivation

Illegal, unreported, and unregulated fishing practices amount to nearly 60% [1] of
all tuna caught around the world. If this trend continues, half of the world’s population
depending on seafood may be in danger as current fishing practices threaten to destroy the
earth’s fragile marine ecosystem. The Nature Conservancy aims to utilize technology to
preserve fisheries and protect nature for future generations.[2]
With advancements in image recognition, The Nature Conservancy wants to start
implementing cameras to monitor fishing activities to increase compliance by filling in
for underreported fishing activities. Hardware and electronics are fully prepared for mass
deployment. However, the cost for processing the massive amounts of data is unfeasible.
The Nature Conservancy is reaching out to the data science community to stem the cost
by implementing image processing and classification. With an algorithm that can correctly
identify fish in the picture, countries will be able to redirect resources to address issues
affecting marine life. Machine learning will help us learn more about our marine life as
well as maintaining a healthy balance for its ecosystem.

1.1.1 Kaggle

This problem comes from a machine learning competition database. These are real
world problems being solved by algorithms such as the one we will be discussing. One
of the harder parts about theory is putting it into practice as you will see in this report.
Professionals in the field enter in Kaggle competitions, so we have our work cut out for us. If
you are interested in entering a competition, follow this link: https://www.kaggle.com/
1.2 Objective 7

1.2 Objective
The Nature Conservancy will provide a limited dataset of the type of photos that will be
taken on top of boats. It will be our job to develop an algorithm to predict the likelihood of
fish species in each picture. There are eight target categories that are available in the dataset:
Albacore tuna, Bigeye tuna, Yellowfin tuna, Mahi Mahi, Opah, Sharks, Other (fish present
but none in the category), and No Fish (meaning that there are no fish in the image). Each
image in the given data set only has one fish category although there may be more than one
fish inside the picture.

1.3 General Approach


The basic premise of the algorithm is to first get rid of as much unnecessary information
as possible and then classify the image. The types of training, testing, and actual images
that are supplied by The Nature Conservancy are very large with only a small portion being
relevant to actual classification.
We initially chose Neural Networks as a group topic to tackle this issue. However,
upon further reading and consultation with Dr. Ali Sayed, our group has decided that to
implement an effective algorithm, we would also need to incorporate Convolutional Neural
Networks, hence the title. The largest advantage in using convolutional neural networks
is its property of being able to find relevant feature vectors from raw data. In comparison,
neural networks require feature vectors as inputs. Granted, a single pixel could be used
as a feature, however, convolutional networks also have the advantage of using spatial
correlation in the image to reduce the total number of weights in the network. With the
dataset consisting of noise and numerous variation, figuring out proper feature vectors by
ourselves seemed ineffective.
With convolutional neural neworks we decided to use one convolutional neural network
to find out if there is a fish in the image. If a fish does exist, it would find where it is and
crop the image out. After that we then would use a second convolutional neural network to
classify what type of fish is presented in the image. This seemed like the best technique to
address noise and complexity issues. Figure 1.1 shows an overview of this process.

1.3.1 Assumptions
This report is based on the assumption that the audience is familiar with the basic
concepts of neural networks and convolutional networks. We will only be focusing on
some of the basics, analogies, and smaller revelations that we have experienced throughout
lectures and our own research outside of class. Our purpose isn’t to reteach the material. It
is to cover an overview on some of the many things we’ve learned and to provide insight to
approaches through what we’ve found in our own experiences. For more information please
refer to Dr. Ali Sayed’s book: Adaptation, Learning, and Optimization over Networks,
Foundations and Trends in Machine Learning.
8 Chapter 1. Introduction

Figure 1.1: Simplified diagram of how algorithm works with categories of the types of fish
being identified. A raw image is fed through the 1st CNN to find only the fish segment of
the image. Then, the segmented fish is fed through a 2nd CNN to classify the type of fish.
This method is to reduce the complexity of the problem that each CNN must solve. [3], [4]

1.3.2 References
Different references will be listed throughout the report. To see the specific references
please refer to the bibliography section. However, beforehand we want to preface that the
vast majority of the report is based off of Dr. Ali Sayed’s books and notes. Due to the vast
amounts of references it is best to assume that anything that doesn’t have a direct footnote
has come from Dr. Ali Sayed.

R All the algorithms were built, tested, and analyzed using Matlab, as per request from
Dr. Ali Sayed. The way we talk about algorithms will be framed with this framework in
mind. For more information you can visit there website: https://www.mathworks.
com/

R The code that this report was based off of will be placed in a separate file to this report.
1.3 General Approach 9

R 2017.
c All Rights Reserved. No part of unreferenced sections of this report can
be reproduced, posted, or redistributed without written consent from Professor A. H.
Sayed (UCLA, email: sayed@ucla.edu).
2. Theory Overview

2.1 Neural Networks


Our group had originally signed up for only neural networks, but with the complicated
task at hand we decided to also do convolutional neural networks. Since neural networks
lay the foundation for convolutional neural networks it makes sense to cover it first. This
section also only includes ideas that we have read through. In the actual implementation of
the algorithms we will see that the complexity and ideal cases don’t always work.
Neural networks are based off the idea of neurons in our brain. The overarching idea
is that our brain can take in any sort of input of information and learn how to interpret the
data. Let’s take for example a freshly baked raspberry pie. Our brain has several different
sensors to interpret it. Using only our eyes visually we may notice heat and the color to tell
it is indeed a freshly baked raspberry pie. Our nose is a different sensor, a different data
input. The smell might be able to conclude the same thing. Moreover, the brain can learn
so that the next time you see or smell a raspberry pie, you are more certain that it is or not
raspberry pie.

2.1.1 Perceptrons
The brain is not fully understood, but presently the scientific community has enough
knowledge to recreate part of the brain functionality through neurons. Neurons make up the
brain and are all interconnected. They have inputs, and after a large enough impulse the
neurons fire out a signal to all the neurons connected to them. These neurons can be viewed
as a perceptron.
In terms of the algorithm each neuron will be represented by a perceptron. The per-
ceptron will receive an input and with the input its output signal will basically be a linear
combination with given weights, mapping the input to the output:
2.1 Neural Networks 11

z = hT w − θ (2.1)

Equation 2.1 at a very basic view represents the underlying simplicity of neural networks.
z is just linear combinations of received inputs, hT w, with θ representing a bias. After going
through linear mapping the z is fed through a nonlinearity called the activation function.
Using the biological analogy again, the activation function represents the excitement
threshold needed for a neuron needed before firing its own signal. Other arguments have
been made for different types of activation functions and why they’re needed at all.
One interesting argument is that the world is nonlinear and therefore it is useful to
introduce some non-linearity into the mapping. This seems like a plausible argument.
One of the hardest parts of networks is generalization to datasets that the network hasn’t
trained on before. A pure weighted linear combination would perform well on data that it’s
seen before. Although, once unfamiliar, different data is introduced the likelihood that it
will do well is low. Thus, by introducing nonlinearities the mapping is no longer a linear
mapping which helps with generalization to data not seen before by the network. Below is a
perceptron with one common type of activation function called a sigmoid function.

Figure 2.1: A biological representation of the neuron and a perceptron. As can be seen,
the perceptron aims to mimic the function of a neuron by sharing a similar topographical
structure as the neuron.[5]

The full neural network comprises of numerous of these perceptrons interconnected.


On one end there are inputs, called feature vectors, which is a representation of the data.
As the inputs propagate through, a set of weights will map one spacial representation to
another. At the very end after going through the neural net there will be a final layer with
12 Chapter 2. Theory Overview

a final activation function that will allow of some interpretation. In our case the last layer
may tell us whether there is a fish or not for the first convolutional network. In the second
convolutional network it could be interpreted as the likelihood that the fish is one of the
eight categories given by The Nature Conservancy. This is all under the assumptions that
the neural network has the right weighting factor to map one high-dimensional space to
another correctly.
Figure 2.2 depicts exactly how the linear mapping of perceptrons interact with each
other. h is the feature vector or numerical representation, the input. Each yellow circle
represents a perceptron. In many constructions of these networks sometimes they can have
hundreds if not thousands of these nodes per layer.

Figure 2.2: Interconnected perceptrons with input and output

2.1.2 Network Training


Now the big question then becomes how to you get the proper weights to map from
one space to another space. This is where training comes in. The idea is that there is some
cost function that will help us evaluate performance. We will consider a cost function with
regularization with regards to its Forbenius norm and least squares with regards to the actual
labeling, γ.
A useful property of this type of cost function is its empirical cost function will approach
the true cost function given enough samples. This type of cost function is said to have
ergodic properties which is very useful. by reducing the error between the true labeling, γ,
and the classification, y, we can find a more reliant mapping with better weights.
The regularization term given by the Forbenius norm squared will play a vital role in
reducing over-fitting and improving generalization. There are numerous other types cost
functions with different regularizing methods as well as minimizing error functions that
2.2 Convolutional Networks 13

will be discussed later in the report. The main idea here is to find the relevant weights to
minimize this cost hence providing valid weights to map from the feature vector space to
the actual labeling space. The bias, θ , also plays an important role when trying to minimize
the cost function.
L−1
1 N−1
Jemp (W, θ ) , ∑ ρkWl k + ∑ kγn − yn k2 (2.2)
l=1 N n=0

There are many different algorithmic approaches to train the network properly to arrive
at the best weights and biases. The main idea is to algorithmically descend the the cost
function with stochastic-gradient descent. In other words finding the derivative at a certain
point and moving in the opposite of the gradient. Further on in the report we will delve into
more details and share our experiences and approaches that were applicable to solving the
fish classification problem.

wn = wn−1 + µ(−B∇w∗ J(wn−1 )) (2.3)


θn = θn−1 + µ(−B∇θ J(wn−1 )) (2.4)

Equation 4.6 shows the recursive gradient descent algorithm. This will be used to
descend the cost function to find weights and bias. This recursive formula describes the
descent on of the cost function in terms of its gradient, constant B, step size µ, and the input
argument you are taking the gradient by. This recursive equation can be used for the bias as
well with little change to the algorithm. Whenever we refer to stochastic gradient descent
we are referring to gradient descent.

2.2 Convolutional Networks


We have generally covered how neural networks work and how to construct them, but
there is one important part missing, feature vectors, h. Earlier we used an analogy of the
human eyes and ears being inputs for the brain to learn from. These inputs take images and
sounds that we experience and transfer them to a signal representation which we constantly
referred to in the Section 2.1.
The follow up thought then should be: How do I convert my input into a readable
signal for the neural network. This is where convolutional networks have an advantage.
Take an image for example in our case. Again using the raspberry pie example, when we
use our eyes as the input device for our brains how do we know it’s a raspberry pie? We
could probably tell by color, shape, and texture. How can we break a picture up into these
components? We need a tool to be able to transfer spatial and color properties relative to its
surroundings. One powerful tool is using a mask, also called a kernel.

2.2.1 Masks
The idea of a mask is that you take a pixel and corresponding pixels around it and
convolve it with a matrix of certain values, the mask. By doing this certain features become
14 Chapter 2. Theory Overview

more prominent. For example the color of raspberry in the raspberry pie example becomes
more noticeable or the circular shape becomes more apparent. Convolution comes from this
terminology of convolving the mask with each pixel and its surrounding pixels. What we
mean when we say convolve is akin to correlate. This can take into account spatial relations
of certain colors and edges. Images with similar correlation patterns after masking should
result in similar original images. Let’s take a look at pixel patch, H , and mask, W .
   
X X X X X X
H = X X X  , W = X X X  (2.5)
X X X X X X

The boxed element in W represents the pixel on which we center the mask. We could
increase the length and width of the mask and pixel area, depending on the problem and
computational power.

Figure 2.3: An edge detection mask overlaid a fish image to emphasize shape of a fish[6]

Looking at Figure 2.3 we see that with a certain mask type, in this case an edge detector
one, we see that the edges are much more visible and much of the unnecessary information
is reduced to zeros. This can greatly help reduce the noise factor and assist in creating a
good set of features.
K K
corr(H , W ) , ∑ ∑ H (k, l)W (k, l) (2.6)
k=−K l=−K

When doing this for different masks or colors, we end up with different features of the
image becoming much more prominent. As in the neural network section we want to create
a linear mapping to a feature map and introduce nonlinearities into the process for the same
reasons discussed earlier.
In Figure 2.4 we see that the construction is very similar to neural networks. These
feature vectors then can be further compressed with pooling. Just as in neural networks we
want to find the best weights and biases that will reduce a cost function that we define. The
way we go about finding these is a similar process of finding the cost function derivative and
moving in the opposite direction. By using essentially the same method we can stochastically
2.2 Convolutional Networks 15

Figure 2.4: (Left RGB channels filter mapped to filtered image. (Right) vector representation

descend the cost function until we reach a minimum that would in theory reduce the amount
of error while still being able to generalize. The result after this whole process is a set of
input features that is representative of the whatever is trying to be classified.

R Again we want to reiterate that this is an overview of what is happening in neural


networks and convolutional neural networks. This report is based more on interesting
tidbits we discovered through our learning experiences. This is not meant to teach
lecture material. Please read Dr. Ali Sayed’s book for more.
3. Initial Approaches

3.1 Neural Networks


Just like fish among sharks trying to stay alive we too were concerned with the most
basic necessities at first, getting to get the neural network to work. The problem we were
trying to tackle was vastly complex and needed to be broken down into bite size pieces that
we could verifiably check to see if it were correct.
As mentioned in the chapter before the fundamentals all depend on getting the neural
network correct. We developed a toy example where we could decide generate data and
process it quickly for a quick sanity check. We decided on coding two separate Gaussian
distributions with mean zero and an adjustable standard deviation in two different areas of
the R2 plane.

3.1.1 Gaussian Distribution Test

For sanity checks and initial testing we took a very simple self generated test using
two Gaussian distributions. Here we can generate as many labeled data examples and play
around with this toy example to gather an intuition for what would happen with certain
parameters. This also reduces the problem into the R2 plane allowing for relatively fast
computation. Here we could consider some analysis again to gain understanding how many
setup the more complex algorithms.
In the following graphs and analysis of the toy Gaussian distribution example we will be
using 2000 training data points. This example is setup exactly how it was laid out in Dr. Ali
Sayed’s notes with implementations of stochastic-gradient training, cross-entropy training,
and softmax training specifically for neural networks. We also dig into learning curve is
shown. Testing data then classified by the trained Neural Network.
3.1 Neural Networks 17

Figure 3.1: This is the figure depicts the two separate Gaussian distribution in a plane. We
moved them closer and farther apart to see how accurately our classifier would work.

3.1.2 Step size µ


One of the most important aspects of networks is the learning algorithm. As mentioned
previously we will learn with stochastic gradient descent. The descent learns in terms of
reducing the cost function. There is a coefficient that determines how much we should learn
from new data relative to what was already learned. That coefficient is called a step, µ. This
is one of the most crucial parameters to tune correctly. A good step size can result in quick
convergent learning, ultimately allowing us to classify the data correctly. We keep our step
size, µ, constant so that our algorithm is always learning with constant streams of data.A
poor µ size could result in bad learning and misclassification.
In the following figure 3.2 we note that the graph on the right shows the Gaussian
distributions in red and blue crosses. It also shows the classification labels with circles. On
the left hand side is the cost function over number of samples for a given µ. Performance
varies depending on the µ.
In 3.2 we note that the cost’s derivative of µ = 0.001 is small in comparison with to the
Bottom graphs with µ = 0.1 resulting slow learning. For a large µ when it equals 0.1 we
note that it is very jagged. This is due to what’s called thrashing. As a figurative argument
with a large step size, one could imagine that if the cost function has steep walls, a large
µ would result in it bouncing off the walls and at times increasing the cost. In some cases
you could get diverging results. In contrast with a µ too small it would result in very slow
18 Chapter 3. Initial Approaches

learning requiring much more time, computation, and possibly samples. There is an optimal
step size that we should be searching for, in this case it’s when µ = 0.01. Of course with
the tuning of multiple parameters, it could result into a multidimensional search for the
best combination of tuning parameters. For the most part people in the field test out many
different parameter sizes on training data to find the best combination of parameters.
3.1 Neural Networks 19

Figure 3.2: (Top) With µ = 0.001 we note that the step size decreases the cost function but
takes a long time to get there. (Middle) With µ = 0.01, the optimal choice, we see that the
cost function immediately drops, and it classifies quite well. (Bottom) Here µ = 0.1 and we
note that the cost function starts to thrash and misclassification occurs with increasing cost
function
4. Convolutional Neural Net

4.1 Weaving Nets


Due to the complexity of classifying several types of fish and seeing the quality of each
photo, we came to the conclusion that we needed to step into the next packet, Convolutional
Neural Networks. The one clear advantage of convolutional neural networks is that it can
find the right feature vectors and corresponding weights that will map our image to the
correct labeling.

4.2 Vertical Horizontal Line Test

Figure 4.1: Toy images of a vertical bar (reft) and of a horizontal bar (right) to run initial
tests on the convolutional network. These allowed us to iterate our algorithm at a fast rate
while still presenting the CNN with a small classification challenge.

For the first convolutional neural network, We wanted to do a few sanity checks by
starting with simple, small images with either a vertical or horizontal line as seen in Figure
4.3 Icons Recognition Test 21

4.1. Just like in the Gaussian distribution example we want to create an experiment we
could easily and quickly control. While seemingly trivial, these allowed us to iterate quickly
and fix issues in the algorithm in a relatively low dimensional problem. Then, we could
move to the higher dimensional images with some level of confidence that the underlying
algorithms were functioning properly.
This simple test case enabled us to test the inner workings of the convolutional network
at a fast pace. Since each image was only 16x16, we could quickly execute training runs,
adjust parameters, and plot the cost function. We used this in part to gain some intuition
about different cost functions and initial conditions. We played around with softmax and
cross-entropy then reasoned the results by reading papers and articles.
One of the more spectacular occurrences was that the convolutional neural network still
classified correctly with a bug inside. On the backward pass between the neural network
and convolutional network, the indexing prevented a number of sensitivities from passing
upstream. Despite this, the convolutional network still trained the remaining weights and
was still able to converge about 70% of the time. This bug was only found when the code
was optimized to allow the move to the larger images.

4.3 Icons Recognition Test


In the second type of architecture, used for the second convolutional neural network,
we started off with a simple test to make sure our convolutional neural network (CNN) can
work well in a 2-classes classification (fish or no fish) and 6-classes classification (types of
fishes), we tested the network with 6 kinds of 32X32X3 icon images: Google, Facebook,
Instagram, LinkedIn, YouTube, and Twitter, shown in Figure 4.2.
First, we tested the network with Google’s and Facebook’s icon to see its performance
in the 2-classes classification case. Since the background color of the icon might make the
classification too easy, we transferred the RGB icon images to gray-scale. This transfor-
mation was also implemented in the later 6-icons-classification test. To get the dataset, we
created 120 Google icons and filled them with noise as well as 120 Facebook icons with
noise (Figure4.3). The noised effect is implemented with Gaussian noise with zero mean
and variance equal to 0.35. With the total of 240 data samples, we trained the network
with 34 data, i.e. 180 data, and then tested it with the rest of the 60 data samples. The
training method we used is K-fold cross training with 4 fold, and the preprocessing we did
is centering the input by subtracting the image value by 128. Moreover, the parameters used
in the convolutional network (CN) and neural network (NN) are shown below:

• Number of layers in CN: 3


• Number of depths at each layer (including input layer): { 1, 9, 12 ,13 }
• Number of layers in NN: 4
• Number of node in each layer: { 208, 148, 118, 2 }
• Step size µ: 5 × 10−4
• Regularization coefficient in CN ρCN : 1
• Regularization coefficient in NN ρNN : 1
22 Chapter 4. Convolutional Neural Net

The result of this test is 100% correct which shows the network is able to use small
amount of data to classify the Google icon and the Facebook icon. Here is an interesting
thing, in the Neural Network Handout, the bias coefficients, θ , are initialized with the
Gaussian distribution with zero mean and variance one, and the combination weights are
initialized with the Gaussian distribution with zero mean and customized variance √1nl which
is a typo and should be n1l , where nl is the number of depth or node in the l layer. However,
we found that when the variance of initialization of the bias coefficients equal to 1, the
output of the CNN is saturated as a constant, but if the variance is 0.1, the performance
is improved much better. In addition, in the initialization of the combination weights, we
found the performance with the variance equal to n12 is better than √1nl and n1l . This means,
l
in our CNN, the initial value of the bias coefficients and the combination weights need to be
small.

Figure 4.2: 6 kinds of logos

Figure 4.3: Noised Google and Facebook icons

To see how the CNN works, we reduced the variance of the Gaussian noise to 0.05,
and get the noised images. Then we extract the feature maps before and after the rectifier
activation function. Take Facebook icon and the feature maps first two layers for example
(Figure4.4 & 4.5). From the feature maps in the first layer, we can see that the CNN separate
the useful information and the noise appropriately.

Figure 4.4: Noised Facebook icon with the variance = 0.05


4.3 Icons Recognition Test 23

Figure 4.5: feature maps before (upper one) and after (lower one) the rectifier activation
function in the first layer

After the success of 2-classes classification, we moved on to the classification of 6 icons.


We noised the gray-scale icon images, shown in Figure 4.6, which are transformed from the
32X32 RGB icon images, as our input as before. The number of the total data is 480 with
80 for each kind of icon. With the same set up of the dataset and the training algorithm, the
parameters used in the convolutional network (CN) and neural network (NN) are shown
below:

• Variance of the Gaussian noise: 0.3


• Number of fold in the k-folds cross-training algorithm: 4
• Number of layers in CN: 3
• Number of depths at each layer (including input layer): { 1, 7, 12 ,14 }
• Number of layers in NN: 4
• Number of node in each layer: { 224, 160, 128, 6 }
• Step size µ: 1e−4
• Regularization coefficient in CN ρCN : 1
• Regularization coefficient in NN ρNN : 1

Figure 4.6: Noised icons with the variance = 0.3

At the beginning of the 6-icons test, we used the same variances of the initialization of
the bias coefficients and the combination weights, i.e. 0.1 and n12 . However, it turned out
l
the probability of each icon predicted by the network is equally 16 which means the network
just made a random guess. To have an insight to tune the variances, we first looked into
the computation of outputs in the last layer of neural network. We found that the values
of the bias coefficients are larger about ten times than the values the combination weights.
Therefore, we decided to decrease the variance of the bias coefficients to 0.01, and increase
the the variance of the combination weights by changing the relation between the variance
and the node/depth number to n12 . The result turned out the network always predicted the
l
24 Chapter 4. Convolutional Neural Net

icon with the probability one. From these two results of classification, we noticed that
they are two opposite extreme points, and thus there might be some values of the variances
between the two extreme point such that the network will learn well. Finally, we used the
variance of the bias coefficients equal to 0.01 and the variance of the combination weights
1
equal to n1.65 . It is quite amazing because we only need 360 training data and 4 folds to get
l
a good network to classify 6 icons. This setting give us a successful prediction with a 87.5%
correction. From this test, we can have an intuition that a good initialization will enable the
network not to fall into the undesired local minimum and arrive near the global minimum
point.
Here, we present the feature maps (Figure 4.7), before and after the rectifier activation
function, of Twitter icon image in the first two layers. It shows that even though the data is
noised a lot, the network still can classify it under the class of Twitter icon. Furthermore,
it successfully extracts the useful information. For example, the 8-th image of the feature
maps before activation function in the second layer has a bird-shape outline in the middle of
the image.

Figure 4.7: The network can extract some useful information from the noised Twitter icon
(the most upper one). For example, the 8-th image of the feature maps before activation
function in the second layer has a bird-shape outline in the middle of the image.

4.4 Softmax Vs Cross-Entropy


As mentioned in the Theory Overview section we mentioned stochastic gradient descent
to minimize the cost function ultimately leading to learning of some weights that map the
feature vectors to the correct label. Stochastic-gradient is the basic way to train a network.
Softmax and cross-entropy use the idea of stochastic-gradient descent to train for specific
purposes.
One of the advantages of the softmax algorithm is that in the output layers they will
always sum up to one. This is useful if you have multiple ccategories like our case. Of
course we probably care more about performance rather than the convenience of interpreting
4.4 Softmax Vs Cross-Entropy 25

the output of softmax. It is still nice to have a nice probability distribution over the categories
though. The node with the largest number in the output layer will be classified as such.
!−1
Q
zn (q) zn (k)
yn (q) , e ∑e (4.1)
k=1

[J] jk , (δ jk − y( j)y(k)) (4.2)

δL,n = 2J(yL,n − γn ) (4.3)

δl,m = f 0 (zl,m ) (Wl,n−1


T
δl+1,m ), (4.4)

Wl,n = (1 − 2µρ)Wl,n−1 − µδl+1,n yTl,n (4.5)


θl,n = θl,n−1 − µδl+1,n (4.6)

The cross-entropy has its own distinct benefits. Since its cost function has a logarithmic
term as the non-regularized term, it cancels out the plateau that is present in the soft-max
last layer function therefore speeding up the learning[7]. In other words since soft-max
uses derivatives of the activation functions you start to see saturation which lead to poor
learning. This is no longer the case for the logarithmic cross-entropy cost function. The only
requirement is that the last layer needs to contain sigmoid activation functions to properly
implement cross-entropy. As a side note you don’t want to implement sigmoid throughout
the entire network because it would produce a vanishing gradient.
L−1
1 N−1 Q
Jemp (W, θ ) , ∑ ρkWl k2F + ∑ ∑ ln(yn(q)γn(q)(1 − yn(q)1−γn(q)))
N n=0
(4.7)
l=1 q=1

δL,n = yL,n − γn (4.8)

The following graph shows the three different types of ways to train networks. This
is used for the over-simplified Gaussian distribution example above. With such a trivial
example it’s hard to see the finer nuances since in this case it looks very similar. If you
look closely enough at figure 4.8 at the bottom graph, at the very beginning cross-entropy
descends faster than stochastic descent despite starting at a higher value. If you were
to measure gradients, you would indeed see that cross-entropy drops the fastest in the
beginning meaning that it’s learning off the data at a faster rate. We see this much more
pronounced when looking at higher dimensional problems. The other thing to note is that
we don’t except the cost functions to converge to the same place. This is simply due to the
fact that the definition of the cost function value depends on how you define it to be. The
26 Chapter 4. Convolutional Neural Net

Figure 4.8: (Left) Neural network ensemble average with smaller µ = 0.005.(Right) Neural
network ensemble average with slightly bigger µ = 0.05.

Figure 4.9: (Left) Convolutional neural network Jemp at 240 iterations.(Right) CNN learning
slope for 240 iterations.

cost function for all three of these methods are different especially if you choose coefficients
to adjust the weight of certain terms.

In the final figure, fig. 4.10 you can really see the steepness of the cross-entropy learning
slope. At the very beginning it is extremely high after which it drops. This implies that the
learning rate is much higher in the beginning whether than be good or bad learning. This is
more of a side note, but just because it converges somewhere doesn’t mean it’ll converge
to a good classifier. The idea for cross-entropy is to learn at a faster rate allowing for less
computation time and processing of data.
4.5 Numerical Checking 27

Figure 4.10: (Left) Convolutional neural network Jemp at 60 iterations.(Right) CNN learning
slope for 60 iterations.

4.5 Numerical Checking


One quick check we can do to see that these algorithms work is actually doing a
numerical checking of the gradient. The numerical checking was introduced in the Neural
Network handout. This however can also be implemented in convolutional neural networks.
The checking is basically checking to see how the outputs change when perturbing certain
weights. For example take the CNN, we can perturb an element like Wc ,Wl ,Θl or θl . For
instance, let’s add a perturbation element in Wc .

(d,2)
Figure 4.11: Perturbing weight w j

(d,2) (d,2)
If we perturb on the particular element of w j , let’s say (w j )(α, β ) is perturbed
to (α, β ) ± ε. We apply both the positive perturbation and the negative perturbation on
(d,2) (d,2) (d,2)
the (α, β ) terms of w j , and the resulting is (w j )+ε and (w j )−ε respectively. For
feature the feature vector and label, (h, γ), coming in, will propagate forward and give two
different outputs, y+ε and y−ε .

∂C(W, θ ) 1 (l) (l)


(l)
≈ [C(wi j + ε) −C(wi j − ε)] (4.9)
∂ wi j 2ε
28 Chapter 4. Convolutional Neural Net
1
Calculating the right hand side of equation 4.5 from the handout gives us 2ε [kγ − y+ε k2 −
kγ − y−ε k2 ] (central difference theorem) and compare it with the (α, β ) term in the value
calculated by equation 4.5 will allow us to check the gradient.

∂ kγ − yk2
= (1TDc ⊗ Hn,c−1 )∆n,c ; (4.10)
∂Wc

Figure 4.12: Perturbation in the weights perturbing the output.

Figure 4.12 above shows figuratively what is meant by perturbing a certain weight and
getting a resulting effect in y. This makes sense since everything is interconnected we
expect to see some change in the output unless there is an entire layer of weighting zero or
something highly unusual that would zero out the effect of the perturbation on the output.
As we see in the following figure that the numerical checking values are quite close. This
proves that our algorithm and set up work.
One smaller issue we ran into when calculating the numerical gradient was the central
difference theorem lower limit. The smallest ε value we can have is 1 × 10−5 and as we hit
the limit the gradients start to give us larger and larger errors for the perturbation. Once we
fixed that we were able to use this same method for convolutional neural networks. The
maximum difference on the numerical checking turned out to be on the oder of 1 × 10−9
4.5 Numerical Checking 29

Figure 4.13: (Top) Numerical gradient checking.(Bottom) The specific structure we are
checking.
5. In Practice

5.1 Fish & Datasets


Just as important as building the convolutional neural network, choosing the right type
of dataset plays a crucial role in performance and training. Unfortunately, implementation
with actual live data is harder than expected because of all of the variability with limited
data. If we want to do good classification in the limited time, we had to find a methodology
to reduce the complexity without taking away too much from performance.
In the first sets of pictures in figure 5.1 we see the types of data we need to go through to
classify if there is a fish, and if there is, what kind of fish is it. As you may have noticed that
image quality and variability can change drastically. The picture on the top shows a picture
taken at a certain time of day with a specific lighting that adds almost a filter. In addition
to that the fish is only partially shown. In some cases humans can’t even make out what
kind of fish is in the image. In the lower image we see that we can probably make out the
fish and what kind it is. However, there are small distortions that make it difficult to make
out in the most ideal cases. There will be times when water or fog distorts the image. Even
more so the picture sizes and camera type may differ. Some images are longer then others
or wider. Other images are taken with a fisheye lens. The largest issue we encountered was
dealing with image size. Dealing with differing image sizes is much harder than one would
think. With all these factors to take into account, we decided to try and simplify the problem
down.
Our first priority is to try cut as much noise as we can from the image. As you can see in
the following images, you can see that the fish only makes up a small portion of the image.
Most of the picture is either of the boat, water, or people. In order to increase our chances of
classifying the image correctly it makes sense to first find the fish and then crop the noise to
drop the uncertainty. This is what our first convolutional neural network does. This creates
a smaller problem within itself. Fortunately, we found datasets on Imagenet of fish, people,
5.1 Fish & Datasets 31

and poop decks that we could use to do this task. This intuitive approach is based on the fact
in this first convolutional neural network, classifying if there is a fish or not, you just have
to know what a fish is. That means Imagenet’s library can be used to increase the training
data with much cleaner data.

Figure 5.1: (Top) The picture here is one of the few types of pictures with quality that makes
it hard to classify.(Bottom) This is representative of descent quality of some pictures.

The following figure 5.2 shows the types of data in Imagenet. The one on the far left
depicts how the images in the library appear. If time did not play a factor we could very well
have used an RGB image (an image with color that has three separate intensity layers of
information in red, green, and blue additive primaries). Feeding in one RGB image would
is computationally costly; it’s almost like inputing three separate images. Since the first
convolutional neural network only classifies whether there is a fish or not, we came to the
conclusion that simplifying the problem down would help us meet our project deadline
without giving up classifications. After a few discussions, we came to the conclusion that
reducing the image to gray-scale was good enough.
The other big question was how to deal with images of differing sizes. We initially
came to the conclusion of using uniform noise in the background while randomizing the
orientation of where the image was in the plane. For example on an R2 space we filled it
with noise and randomly placed the center of the image somewhere in this space. The result
is the center image in figure 5.2.
We thought randomization of location and uniform noise would avoid classifying
based off of dimension edges. This may have been one of the contributing factors to
non-converging cost functions. Despite randomization it may have created unnecessary
32 Chapter 5. In Practice

calculations or unrelated correlations.


The image all the way on the right is what we decided to do. We cropped out a set of
ideal images with ideal background noise (humans, sea, and boats etc.) and ideal size. we
still kept the gray-scale to again to run faster tests. The large image size requires months
of training which is beyond the alloted time for this project. Fortunately, we ended up
converging on this method.

Figure 5.2: (Left) Original types of images pulled off of Imagenet[4].(Center) The picture
on the left depicts a technique to deal with dimensionality issue.(Right) A second method of
dealing with dimensionality by cropping ideal images of fish.

5.1.1 Curse of Dimensionality


As we hinted earlier throughout this report, image size gave us the most trouble. Once
building the convolutional neural networks, we tested them out on toy examples and they
worked great. They converged quickly and classified 16x16 images and 32x32 images
correctly 100% of the time. Even in one of our codes there was a massive bug at the
beginning, but the classifier still classified things correctly. However, once we tried to run
the CNN on our actual data set of 512x512 things start to fall apart.
Larger images yield a much larger number of features vectors at each subsequent
layer of the network which require both additional memory and processing power to run.
Furthermore, larger images also increase the complexity of the problem and may require a
larger number of depths/nodes to properly fit, increasing the VC dimension of the classifier.
As we know from the discussion on VC dimension in Professor Sayed’s lectures and notes,
a classifier with a larger VC dimension requires a larger number of training samples to reach
the same accuracy. The classifier would need more and more training points to learn how to
correctly map out the space.
Since we knew dimensionality was going to be a problem from the beginning, we
decided to transform the images from varying sizes of 400x500x3 to 700x1200x3 to square
512x512x1 gray-scale images. In this way, we could have multiple convolutional layers
with the constant 2x2 pooling sizes while still reducing the dimension of the problem. Due
to the size of the images, we could only implement a convolutional network with depths of
{20, 40, 60} and neural nodes of {1000, 1000} before running into memory or processing
5.1 Fish & Datasets 33

issues due to the sheer size of the feature vectors (512x512 = 262144 feature vectors at
the input, 256x256x20 = 1310720 on the second, 128x128x40 = 655360 on the third, etc.).
Since we were presented with a number of computational challenges, they are all detailed in
a later Section 5.3. In addition, the academic papers and resources we had found elsewhere
mentioned much larger architectures for a smaller problem of 224x224 images [8]. Due to
this, we attributed a portion of the reduced error to the CNN under-fitting as we had much
fewer depths for a larger problem.
After running the CNN with different parameters and still yielding a poor classifier, We
decided to reduce the dimensions again by reducing the image size. This time, we manually
cropped the photos to be as close to a 280x280 image as possible while still capturing
the entire fish. Then, we used Matlab to scale the images slightly such that they all were
uniform 280x280 images for the CNN to classify.
By reducing the number of initial features, we could now increase the dimensionality of
our classifier to better fit the problem. While we kept a wary eye out for signs of over-fitting,
we were able to increase our depths to {48, 64, 72} on the convolutional side while reducing
the number of neural nodes to {1000, 1000}.

Layer Number Depths/Nodes Number of features Number of Weights


CNN Input 1 512x512 -
CNN Layer 1 20 256x256x20 9
CNN Layer 2 40 128x128x40 9
CNN Layer 3 (NN Input) 60 64x64x60 9
NN Layer 1 1000 1000 1000x64x64x60
NN Layer 2 1000 1000 1000x1000
NN Output 2 2 2x1000
Table 5.1: 512x512 image CNN dimensions. The largest entries are easily seen as the
number of features at CNN Layer 1 and the number of weights on the NN Layer 1.

Layer Number Depths/Nodes Number of features Number of Weights


CNN Input 1 280x280 -
CNN Layer 1 48 140x140x48 9
CNN Layer 2 64 70x70x64 9
CNN Layer 3 (NN Input) 72 35x35x72 9
NN Layer 1 800 800 800x35x35x72
NN Layer 2 800 800 800x800
NN Output 2 2 2x800
Table 5.2: 280x280 image CNN dimensions. Again, the largest sizes are in CNN Layer
1 and NN Layer 1. Due to the reduced image sizes though, the depths were increased by
a large margin. This allowed the number of neural nodes to shrink and overall had better
performance.
34 Chapter 5. In Practice

5.2 Weaving Nets


When we introduced different tests, the Vertical Horizontal Test and Icons Recognition
Test, we also purposefully mentioned two types of architectures. When I say architectures, I
am referring to not only the structure of the network but also how the algorithm was coded.
A big part of creating two separate architectures for convolutional neural networks was to
try different techniques. The other part is to verify each part of the algorithm with some sort
of check saying that we were developing these convolutional neural networks from scratch.
It is useful in this way for each algorithm to have two separate approaches.
Here we will talk about the whole algorithm and different approaches. If you want to
read more about the exact differences between our algorithms and the ones in the notes
please refer to the Algorithm Adjustments section below.

5.3 Challenges
The largest challenge for this classification problem is the size of the image with the
given time constraint. As we mentioned earlier that processing all this information to get
really accurate readings would result in months of training given 1240x700 pixel images.
This one issue leads into all our other problems. On small toy examples both of our
architectures work great.
With the complexity and dimensionality increasing there is also an issue of parameter
sensitivity. We noticed with smaller images that there was a large range of parameters
and methods that converged to a good classifier. However, with a much larger image and
dimension we noticed that some of the parameters would converge to weak classifiers.
Unfortunately, we ran out of time to try the more effective bagging and boosting methods.
Another nuanced challenge was deciding which source to listen to. There are many free
online materials with different points of views and techniques. The class lecture material
and notes formed a good foundation to build off of, but there was definitely a gap that needed
to be bridged in terms of intuition and knowledge. In many ways this was challenging but a
good learning exercise which is probably how Dr. Sayed set the class up.

5.3.1 Computational Limits


One of the largest challenges was that our application required more processing power
and memory than was readily available. For reference, the computer used to execute most
of the code was equipped with a 6-core i7-5820K @3.30GHz, 16 GB of RAM, and an
Nvidia Quadro K4200 GPU. Due to the large number weights/depths/features at a particular
layer, Matlab often returned a matrix dimensions exceeded error as it required larger spaces
of RAM for intermediate matrix calculations. Even more so, this memory issue caused
Matlab to crash on the computer on a number of occasions. Therefore, we not only had
to build the algorithm, but we also had to continuously optimize it to produce any results.
At the beginning, a single 512x512 image took up to 5 minutes to run through forward
propagation, backward propagation, and the weight updates. We ended up cutting it down
to around 38 seconds even with the reduced dimensions mentioned in Table 5.2. Many of
5.4 Compromises 35

the optimizations are mentioned in section 5.5.


We did look into parallel processing on the multiple CPU cores and on executing the
calculations on GPUs in Matlab, but both proved infeasible through different reasons. For
parallel processing, Matlab requires that parfor loops be independent of each other such that
each iteration may be executed on a separate thread. Matlab also needs to copy a separate
instance of each variable required for one loop execution. Since each layer is dependent
on the data from the previous layer, this left us only with the option of parallelizing each
sample for a given batch. The latter constraint of duplicating the variables, however, became
an issue as the algorithm already struggled with memory issues. Parallel processing using
parfor loops on n processors would now require n times more memory. Thus, we were
unable to take advantage of parallel processing with our hardware.
The alternative solution was to use a workstation GPU to execute the computations and
speed up the processing. Matlab supplies a relatively easy to use toolbox that allows the user
to push variables to the discrete GPU memory and run all computations of those variables
simply on the GPU. The only limitations are the specifications of GPU hardware and the
overhead of pushing variables to and from the computer memory to the GPU memory.
Unfortunately, the large intermediate matrix multiplications ended up using nearly 12 GB
of memory, but the GPU we were working with had only 3GB of memory. There was
some overhead with moving the data to and from the GPU, but memory served as the main
bottleneck.

5.4 Compromises
To start out our goal was and still is in many ways ambitious. Due to time restrictions,
we had to compromise on many fronts. One such compromise that was mentioned in
Section 5.1 was the dimension of our training data. Not only did we move from color to
gray-scale, but we also reduced the image size from 512x512 to 280x280. Even though
reducing the dimension of the image prevented us from classifying larger images, we were
able to increase our layer depth and overall performance of the network. It also helped ease
the memory issues and reduce the amount of computations required for each sample.
Another compromise was that we had to cap the number of samples that any given set
of parameters was trained on. The performance of the classifier and its rate of convergence
were highly dependent on the parameters and initial conditions. Since images at one point
took up to 5 minutes to run, we simply did not have enough time to let the networks run
for too long before stopping them. Thus, we were forced check the performance of each
classifier given a training size of 400-800 samples before starting at a new parameter set
or initial condition. We were almost always at odds between letting a particular weight set
train more or simply to reset the training.
Another area we would have liked to do cross-validation tests to test for better parameters.
We’d also have liked to use bagging and boasting. For such a complex problem we are
very interested in seeing how much improvement we could get by using these techniques.
There were also techniques about handeling images of different sizes. It required some
background knowledge in support vector machines and bag of words approach[9] but again
36 Chapter 5. In Practice

given the time we had to limit our self. We will still continue the project after the finished
dates on our own accord, after all it is for a good cause. The limiting factor in all of these
cases with time. This could be said in all fields about learning. The horizon of knowledge
can never be reached. If we keep are learning steps constant, we will always learn[10].

5.5 Algorithm Adjustments


Most of the changes that came to the algorithm occured due to exceeding memory of
the computer and Matlab’s dimension limits. There were a few cases that we had to adjust
the code to have it run faster. Other changes came due to small typos which you can refer to
towards the end of the report in the Typos section.

5.5.1 Architecture 1
The 1st CNN was implemented directly off of the algorithm in Figure 5.3.

Figure 5.3: Stochastic Gradient Descent Algorithm for Convolutional Neural Networks

In this CNN, an alternating pattern of rectifier linear units and scaled tanh functions
were used to allow for fast training, but also to prevent over-saturation. The combination
of cascading the two different types of activations proved experimentally more stable than
either alone. The activation functions for the neural network was set to rectifiers for the first
5.5 Algorithm Adjustments 37

two layers and sigmoid for the last layer for cross-entropy training. At first, the network
used softmax training on the last layer, but was switched to cross-entropy training after
experimental results showed a slightly faster convergence rate. Other important nuances
of this CNN were 3x3 convolutional masks and 2x2 pools at all layers. The network was
initialized with zero mean and a variance of 0.01 while the training weights were zero mean
and normalized by the square root of the number of depths/nodes, respectively [8].
There were a few changes though to to reduce memory/processing requirements or for
easier implementation. The first such adjustment was to the partition function as denoted by
equation 5.1.

col{hn,p } = partition(hn ) ⇐⇒ col{hn,p } = V hn , V ∈ (Pc+1 Sc x Pc0 ) (5.1)

The main practical issue with this representation for images is the size of the partitioning
transformation matrix, V . Even with reduced 280x280 images, Table 5.2 still shows that
V will be of size ((70x70x64x9)x(70x70x64) = (8467200x940800). Even just to store this
matrix, let alone use do a matrix multiplication, would require an extraordinary amount of
memory. The condensed form of v given by equation 5.2 is better, but still requires a large
memory size.

(IPc ⊗ 1TS )V ∈ (Pc x Pc0 ) (5.2)

There are a large number of zeros in these two interpretation that end up taking up unneeded
space. Assuming a constant partition size, we get around this issue by taking advantage
of how the each element of the image is indexed. For example, the left hand argument of
equation 5.2 shows a 2x2 image padded with zeros on the border. Padding ensures that a
3x3 convolutional mask will generate as many partitions as there are pixels in the image.
Also, the index() function is simply a place holder function that returns indexes of each
element. To keep this algorithm general, we now stack all columns of the image into a
feature vector as show in 5.4. It is easy to see then that the indexes the right hand side of the
equations correspond to this stacked representation.

   
0 0 0 0 1 5 9 13
0 x22 x23 0 2 6 10 14
index(
0 x32 x33 )=  (5.3)
0 3 7 11 15
0 0 0 0 4 8 12 16
38 Chapter 5. In Practice
  
0 1
  0 2
..  .. 
   
. .
 
 
x22  5
   
x32  6
   
0 7
   
index( . ) =  .  (5.4)
 ..   .. 
   
x  10
 23   
x  11
 33   
0 12
   
 .  .
 ..   .. 
0 16

Now, each element in a partition can instead be represented by their respective indexes.
For example, the first partition is a 3x3 square in the upper left hand corner represented by
the argument on the left hand side of equation 5.5. The right hand side shows the appropriate
indexes.

   
0 0 0 1 5 9
index(0 x22 x23 ) = 2 6 10 (5.5)
0 x32 x33 3 7 11

   
0 1
0 2
   
0 3
   
0 5
   
index(x22
) =  6  (5.6)
   
x32  7
   
0 9
   
x23  10
x33 11

All the columns of the partition are then stacked to form a feature vector while the
(d,c+1)
indexes are stacked to form a partition vector as shown by equation 5.6. The Hn
matrix can then be constructed by placing each of the feature vectors together. Likewise,
in
the partition vectors can be placed side by side to create a partition matrix denoted as Vc+1
as shown in 5.7.
5.5 Algorithm Adjustments 39
   
0 0 0 x22 1 5 2 6
 0 x22 0 x32  2 6 3 7
   
 0 x32 0 0 3 7 4 8
   
0 0 x22 x23  5 9 6 10
(d,c+1)   in
 
Hn =x x
 22 23 x32 ,
x33  Vc+1 = 
6 10 7 11 (5.7)
x32 x33
 0 0 
7
 11 8 12
0 0 x23 0 9 13 10 14
   
x23 0 x33 0 10 14 11 15
x33 0 0 0 11 15 12 16
(d,c+1)
It can then be seen that each element of the new Hn at each layer can simply be
(d,c)
found by taking the index of tn at that same position as shown by equation 5.5.1.
(d,c+1) (d,c)
Hn (i, j) = tn (Vc + 1in (i, j)), i = 1, ..., Sc ; j = 1, ..., Pc+1 (5.8)

(d,c+1)
In this way, we can construct Hn by only using a (Sc x Pc+1 ) matrix. This reduces
the memory storage by a large factor (in the 280x280 case, by a factor of 940800) as the
mask size is constant and there are only as many elements in Vc+1 in as is absolutely necessary.

Thus, by keeping a matrix of indexes that correspond to the partition elements from the
original image or feature vector, we can save most of the space. Also, since we are now
using array access instead of matrix multiplication, we save a number of calculations.
In addition, as long as partitioning is done in the same fashion across images, partitions
of any given image of the same size will have the same partition index matrix, Vc+1 in . Thus,

for our application of one image size of 280x280, we only need to calculate all Vc+1 in once

and can use them throughout the loops at little computational cost. If there were multiple
sizes, images of the same size could be batched together and we would only need to calculate
as many Vc+1 in as there are different image sizes.

While this method employs the default indexing for matrices in Matlab, any code which
follows this indexing and has constant partition sizes can use this to their advantage. This is
the main advantage towards the generalization of the features as a vector instead of a matrix.
This can also be extended to non-constant partition sizes by using different data structures
that allow variable number of rows. As long as the partitions are known ahead of time, this
algorithm will reduce the memory and processing requirements.
The same method for finding indexes of each partition also applies to the permute
and permute# transformations as well. In the case of permute, the we replace the 3x3
sliding window with a non-overlapping 2x2 square over which we find indexes for the pool
elements. The only difference is that when the pools are separated, there is no extra padding
as the image dimensions are assumed to be divisible by 2. Thus, the pools and subsequent
indexes should effectively divide the number of features per layer by 4. The same concepts
still apply in that we use the indexes to find an appropriate index matrix, Vcpool , that satisfies
equation 5.9.
(d,c) (d,c)
tn (i, j) = yn (Vcpool (i, j)) (5.9)
40 Chapter 5. In Practice

For the 2x2 image example laid out in equation 5.3, the solution is trivial as we only
(d,c)
have one pool of the nonzero elements. Thus, we introduce some 4x4 yn in equation 5.10
to motivate a more developed solution.
   
x11 x12 x13 x14 1 5 9 13
(d,c) x21 x22 x23 x24  2 6 10 14
index(yn ) = index( x31 x32 x33 x34 ) = 3 7 11 15
   (5.10)
x41 x42 x43 x44 4 8 12 16

Now, permute takes 2x2 non-overlapping pools and gives equation 5.11. The pool
function is then called on each column vector made by permute. The transpose is to keep
(d,c)
with the notation that tn should return an column vector of max values. All the same
memory and processing savings are still applicable in this case as well.
   
x11 x31 x13 x31 1 3 9 11
), Vcpool = 2 4 10 12
(d,c) x21 x41 x23 x41 
(tn )T = pool(
 
x12 x32 x14 x32  5 7 13 15 (5.11)
x22 x42 x24 x42 6 8 14 16

Then, looking at the permute# transformation in equation 5.12, we can see that it uses
up more space than is necessary. Using a similar method as before, we can map entries from
(d,c)
a pool back into yn . First, we can apply the same index trick we’ve been using as before
on the 4x4 matrix pool() argument matrix representing the pools. More formally, the setup
yields equation 5.13.

y0 = permute(y) ⇐⇒ y0 = Ty, T ∈ (Pc x Pc0 ) (5.12)


   
x11 x31 x13 x31 1 5 9 13
x21 x41 x23 x41  2 6 10 14
index( ) =   (5.13)
x12 x32 x14 x32  3 7 11 15
x22 x42 x24 x42 4 8 12 16

Rearranging the left hand side to get back the original matrix gives us the following
equation 5.14.
   
x11 x12 x13 x14 1 3 9 11
), Vcpermute_h = 2 4 10 12
x21 x22 x23 x24   
index(
x31 x32 x33 x34  5 7 13 15 (5.14)
x41 x42 x43 x44 6 8 14 16

Because we are using 2x2 non-overlapping sliding windows, we actually end up with
Vcpermute_h= Vcpool . For the general mapping for any pooling indexes to the permute# matrix,
we must first define the intermediate index matrix given by 5.15. The mapping is then given
by 5.16.
5.5 Algorithm Adjustments 41
 
1 5 9 13
2 6 10 14
index_start = 
3
 (5.15)
7 11 15
4 8 12 16

Vcpermute_h (i, j) = index_start(Vcpool (i, j)), i = 1, ..., Ω p0 ; j = 1, ..., Pc0 (5.16)

(d,d 0 ,c+1)
The final large matrix that needs to be optimized is Vn−1 as it is (Pc0 x Pc). Instead
(d,d 0 ,c+1)
of doing such a large matrix multiplication, only one row of Vn−1 is used at a time to
compute each weight on the sensitivities. Although this uses more computational time, it
requires much less memory.
Another adjustment was made just prior to upv in order to compensate for the zero
padding of the image prior towards partitioning. Due to this, the sensitivities should
not propagate on the indexes that correspond to the padding. Thus, when partitioning
is performed, the indexes that correspond to non zero padded elements are also saved.
Referring back to the 4x4 padded example in equation 5.3, this means indexes 6, 7, 10 and
(d,d 0 ,c+1)
11 are saved. Then, just before upv we will only fill Vn−1 with weights the correspond
to non zero padded elements. That is, we start with an array of zeros and only add the
weights that correspond to the non padded elements.
With these adjustments the algorithm was able to perform faster and require less memory
without sacrificing accuracy.

5.5.2 Architecture 2
The second CNN is based on the concepts and algorithms in the Neural Network
handout and the Convolutional Network handout. However, due to limited time and heavy
computations and the huge memory cost of implementing CCN in the MATLAB, we are
motivated to modify the algorithms to run the program more efficiently. In addition, we
added some rules of thumb from other architectures into our algorithms. Those modifications
are explained as follow:
In the propagation process, we first preset how we do the convolution and pooling so that
we can know the size of the feature maps and the reduced maps in each layer. In practical,
a convolutional filter with small size and stride is preferred. Therefore, we choose a 3x3
filter with stride equal to 1, which means the partition is overlapping, for the convolutional
network. It is reasonable in terms of the image process because there is a specific relation
between adjacent pixel to form an image. Moreover, to keep spatial sizes constant after
doing the convolution, we padded the inputs of each layer with zeros around the border. For
pooling, we use a 2x2 max-pooling matrix. This preset enables us to reduce the computation
by only creating the partition order once and saving it for the usage in the future. The
number of the depth of the feature maps in each convolutional layer are set to be increasing
with the propagation, and the number of the node of each layer in the neural network are set
to be decreasing with the propagation. Last, to get more insight of the prediction made by
the network, we use the softmax implementation in the last layer in the neural network to
42 Chapter 5. In Practice

see the probability of each class. Since the exponentials computation can make the terms
ezn (q) and ∑Q
k=1 e
zn (k) in the softmax be very large, which might cause the numerical issue

when dividing the two terms. Hence, we multiply the two terms by a constant C and get the
following expression:
!−1 !−1 !−1
Q Q Q
yn (q) , ezn (q) ∑ ezn(k) = Cezn (q) C ∑ ezn (k) = ezn (q)+logC ∑ ezn(k)+logC
k=1 k=1 k=1
(5.17)

This will improve the numerical stability of the computation but not affect the resulted
value. As the reference recommends, we set logC = −max(zn (k)). This setup makes the
highest value of the vector zn be zero.
For the training part, the second CNN use the same algorithm to reduce the computation
and memory cost.

5.6 Network Architecture


Each of the two configurations rely on the basic ideas introduced at the beginning of the
report in Theory Overview. In figure 5.4 we see it laid out again. With the first half of the
net dedicated to different types of mask extracting feature representations of the data. Once
the features are extracted it is ran through the neural network section for classification.

Figure 5.4: Convolutional neural networks as in notes

Using convolutional neural networks, our classifier is broken up into phases. The First
phase is to judge whether there is a fish or not in the image by way of their transformed
feature vectors. We are not constraining our images to one allowing one fish per picture.
Our algorithm handles pictures with multiple fishes. The second phase is to classify the fish
in the image that pass the first test and been judged +1. We will classify it to be one out of
the six categories. The second phase will receive an input image that is cropped such that a
fish make up a majority of the picture, making it easier to classify with less noise. In our
training data pool, no image with more than one species of fish will appear simultaneously.
Figure 5.5 depicts how the algorithm should work together.
5.6 Network Architecture 43

Figure 5.5: Overview of the setup of the full algorithm

1
In the0 training procedure of the first phase, image features with labeling, γ, equals to
0 or 1 , which means there is either at least one fish inside the picture or no fish at all
respectively. Later on in the section we talk about how we tried to increase the learning rate
of the softmax function.
In the judgment procedure of the first phase after training, we feed all  the0.381
raw images
into the CNN. Ideally, at this 0.823
0.823
 moment, we will have, for example0  0.177 or 0.619 as the
output vector y. The  0.177 will be classified to the group
 γ, 1 as there is at least one
fish, and the 0.3810.619 will be classified to the group γ, 0 as there are no fish. Figure 5.6
1
depicts graphically how the first convolutional neural network operates.
After the new data has been judged whether there are fish or not, we segment the image
such that we crop out everything except for the fish where the picture has been judged γ
1 . The way it works is that it cuts part of the picture with a fish and sends it back into the
0
first network. It process the picture again, and based on whether the new cropped image
has a fish or not it cuts a different part. It continues this process until it has the region of
the picture corresponding to the highest likelihood of being fish is cropped out. If there are
more then one fish then it would store the first piece of the picture of the fish as one fish
44 Chapter 5. In Practice

Figure 5.6: First convolutional neural network used to classify whether there is a fish or not
and if so, where it might be

and then send the cropped image without the region of the highest likelihood of being fish.
The algorithm continues this process until the likelihood of fish having fish in the remaining
image is below the 0.5 threshold. The segmentation process is illustrated in Figure 5.7. And
an ideal segmentation result is show in Figure 5.8.
After the fish(es) is preprocessed in a sense by the segmentation portion as shown in,
they are fed into the second CNN to generate output vector(s) in form of y = (a1 a2 ... a6) T .
Then we will classify the fish with this particular cropped image. The second convolutional
neural network then classifies the fish belonging to the species associated with the maximum
element in output vector y. This is an advantage of softmax. If through a1 to a6 there are
no prominent elements compared to other elements, we say that the fish is not one of the 6
species but instead other.
In the training procedure of phase two, we feed the fish-part-only images which are
cropped in segmentation stage with new labeling (1 0 0 0 0 0) T (0 1 0 0 0 0) T ... (0 0 0 0 0 1) T
representing six kinds of fish to train the second CNN as shown in figure 5.9.
After training of CNN1 and CNN2, we can run classification task on the raw testing
images.
5.6 Network Architecture 45

Figure 5.7: This algorithm is designed to be used in tandem with each other to find exactly
where the fish is and crop the image keeping only the fish, thus reducing the noise

Figure 5.8: (Left) Raw image under testing. (Middle) Imaged cropped by segmentation
process. (Right) remainder of one segmentation process
46 Chapter 5. In Practice

Figure 5.9: Second convoluitional neural network designed to receive well cropped images
with only relevant information. This network is to determine the likelihood that it’s a certain
type of fish
6. Results & Thoughts

6.1 Results

Figure 6.1: (All 48 sets of weights for the 1st layer of the 1st CNN used for fish detection.
Equivalently, these will be referenced to as the masks, filters, or kernels. Each mask is 3x3
and is applied as a sliding window across the feature matrix, H. The lighter portions indicate
a more positive correlation whereas the darker portions show a negative correlation.

Although we haven’t reached our entire goal we are on the cusp. We finally got our
algorithm to train to a decent set of weights that resulted in relatively good classification. The
48 Chapter 6. Results & Thoughts

convolutional neural network yielded a 70% correct classification of determining whether


or not there was a fish in an image.
The most interesting portion of the convolutional network is its ability to learn what
masks it needs to best complete the classification from the training without the input from the
user. Figure 6.1 shows all 48 masks that the convolutional network learned over the course
of training. The lighter portions indicate more positive correlations whereas the darker
portions indicate negative correlations. The masks were also checked against their initial
conditions and showed significant change. While the masks do not give much intuition into
how the network solved the problem, the pre-activation signals in zc,n give some insight into
how the network may have trying to classify fish.

Figure 6.2: (Left) The original 280x280 gray-scale image that is fed into the 1st CNN.
(Middle) The resulting image after the mask at depth 8 is applied in a sliding window
fashion onto the original image. As can be seen, this mask applies a level of edge detection
on very light pixel values to gray borders. (Right) The resulting image after mask at depth
23 is applied. This mask applies edge on a very dark pixel values to gray borders. The
masks evolved out of the training and return edge detection features that contain more useful
information.

One such set of pre-activation signals that was converted back into images is shown
in Figure 6.2. As can be seen, the convolutional network learned varying degrees of edge
detection to help in classifying the images. One of the most prominent features about
a fish in grayscale is its shape. Humans can easily recognize a fishes body by enabling
our own edge detection based on the contrast between the fish and the background. The
convolutional network was trying to copy this behavior to some degree as it is one of the
most common, yet powerful techniques in image processing.
Another pre-activation signal converted back to an image can be see in Figure 6.3. In this
case, the convolutional network learned to whiteout large portions of the image, saturating
most signals. In doing so, the network is able to clear out a large amount of detail that the
network has deemed as noise. The interesting point is that the prominent black back of the
tuna remains intact through the whiteout filter. When asked to classify an image as having a
fish, very few people with limited backgrounds in machine learning would suggest to look
for the black back as a major option. This simply goes to show how powerful convolutional
networks can be in seeing underlying trends that humans can often overlook.
Overall the cost function was minimized as it should have been over most of the training
run as show in 6.4. There were instances where certain parameters would yield large shifts
6.2 Thoughts 49

Figure 6.3: (Left) Again, the original 280x280 grayscale image that is fed into the 1st CNN.
(Middle) The resulting image after the mask at depth 15 is applied in a sliding window
fashion onto the original image. The image is much lighter and only the darkest portions
of the image are leftover. This leads to most of the details in the image to be thrown away
except the dark stripe on the back of the tuna fish. Thus, the CNN might be able to use this
in as a definitive feature when determining whether a fish exists in the image or not

in the cost function. For example, when mu was relatively large while rho was small, the
algorithm seem to react very quickly to misclassification (when the gradient had a relatively
high magnitude). Under these conditions, the network classification was highly dependent
on the previous label.
On the other hand, when rho was very large in comparison to mu, normalization took
over the weighting of the cost function. In this regard, the gradient had little effect as the
network was mostly aimed at keeping the weights low. Thus, the values of mu and rho
needed to be within a certain range that would yield good classification results.

6.2 Thoughts
Given a dataset with the size range from around 600x600 to 1200x900, the issues about
the computation speed, the memory cost, and varying size inputs are raised. To deal with
those problems, our strategy is to implement one CNN first to determine whether there is a
fish in the image. For those images with fish, we use segmentation algorithm to crop them
such that the fish is within the desired size image. This preprocessing not only reduces
the data size but also decrease the effect of noise. Then, we use the second CNN to do
classification of the types of fishes. The two architectures can be seen as different ways
to validate arguments. However, since the make up is a little different, we can also see
them as separate classifiers that are used in different parts of the algorithm to accomplish
slightly different tasks. The slight differences in parameters and structure of the nets give
themselves flexibility to cope with complex problem.
So far, it seems that we are on the right track for setting up the entire algorithm quite
soon. As the result section shows, we got the first convolutional neural network working
properly with relative success given the limited training. The CNN shows how well it
50 Chapter 6. Results & Thoughts

Figure 6.4: (Cross entropy cost function over a training run of 100 samples. The cost
function is generally minimized over the run. The bumps occurred when there was a
misclassification and a relatively reaction by the gradient.

extracts the useful information for the classification. Also we have set up a convolutional
neural network that works well on multi-class classification problem. Given more time we
can definitely implement the CNN and get a good result.
This is by no means easy to solve. Fortunately, through this experience we have gained
some intuition on the parameters and how they affect the overall performance and descent
of the cost function. Like Dr. Ali Sayed would put it: there’s more art in engineering than
you think.

6.3 Future work


Here we talk a little bit about what we plan to continue with this project. As mentioned
before we had little time to do everything we wanted to do. And there still is a competition
we entered.

6.3.1 Spatial Pyramid Pooling & Bagging and Boosting


Throughout this paper we have constantly harped on the image size and dimensionality.
One of the ways we can do that is to use spatial pyramid pooling. [11][12][13] This allows
us to gather feature vectors based on space by using a support vector machine and histogram
bins. By starting off with a large bin which holds all feature vectors, we can then make it
into finer and finer bins. With each layer being finer and finer bins. Features in multiple
6.3 Future work 51

layers are weighted more as it is more prominent in this scheme. By doing this the bins can
be cut to any number allowing for different image sizes to be fed through the network with
out any major issues. In the coming weeks we plan on looking into this technique further
and possibly implement it.
Another method that we are really excited about trying out in the future is bagging
and boosting. With such a complex problem as one could imagine we have many weak
classifiers. We are interested to seeing how far the theory can be put into practice. It’s really
ashamed that we have never heard of this sooner.

6.3.2 Competition
We plan on continuing this project. Kaggle’s competition deadline isn’t for another
month, so we have all spring break to train and modify the algorithm to get it ready for the
competition. If we want any chance of competing with the top we will probably have to
read a few more packets from Dr. Ali Sayed.
7. Feedback & Experience

7.1 Each Member’s Contribution


Colin Togashi is the main coder of the group with the most experience. All group
member’s coded to double check algorithms and results, but he by far lead the effort for
the first type of architecture for convolutional neural networks. He was able to take bits
and pieces of code from other people’s code and implement it smoother to the much larger
code. Colin was responsible for setting up the first small toy test with Gaussian points and
the Vertical Horizontal Line Test. He spent many hours testing out parameters and making
sure that the whole first convolutional neural network algorithm worked. He also set up the
memory management on the computer so that it’s able to process large images relatively
quickly on limited computational power.
Meng-Hao Li did a large portion of research outside about the parameters and the finer
details that are involved with making a successful second convolutional neural network
work architecture. He’s also largely responsible for the 6-icon recognition test. We now
were able to compare two different approaches to the same network and compare results.
Jack Shue worked on a bunch of the analysis that went into the toy examples to make
sure things worked properly as they should, and to decide on what parameters we should use
based on results from the tests. He went into detail specifically on subjects such as softmax,
cross-entropy gradient, and computational gradient checking. He developed many graphs
used to illustrate ideas. He also was responsible for coding different checks for gradient
descent and learning rates. He worked with Meng-Hao Li for coding certain sections.
Gabriel Fernandez worked managing the groups direction and overall approach. As you
may have noticed we split up into two groups and developed two separate convolutional
neural networks. This is akin to bagging and boosting our efforts. It also made for good
validation of results and approaches. Gabriel oversaw this. He worked with Colin on coding
the first architecture of the convolutional neural network. He aimed on condensing some
7.2 Handout Feedback 53

of the algorithms into simpler forms because in the code many times the algorithm in the
notes would result in matrices that were out of bounds for Matlab to compute. He also
looked into spatial pyramid pooling and other resources outside of the notes. He organized
each meeting and kept the direction and focus of the group intact. He was responsible for
managing the report and managing the responsibilities of everyone.
Every week we had at least three group meetings since forming our groups on January 23,
2017. Every week we were responsible to read through, derive, and discuss the algorithms
in the reading. Once we finished with that we all started coding up smaller examples of the
algorithm. From there we started to have people specialize into certain areas and look into
things outside of the reading. The above is only a short summary of what we all worked on.
It is very difficult to say that someone specifically worked on one small topic because we all
worked on all topics and had to have discussions among each other to better understand the
material. There are portions where individuals had special skills or interest, but it wouldn’t
be fair for one to only consider the small list and its limited capacity above. We as a group
were at majority of the office hours of Dr. Ali Sayed and Stefan Vlaski. Each and everyone
in this group has made a tremendous effort and have learned much more than anticipated
through this project. We all had no prior experience with networks in general. To say that
we originally signed up for neural networks, then went into convolutional networks and even
dabbled with some of support vector machines, we as a group think that each individual
should be commended for the effort shown here in the paper and the additional work not
conveyed here.

7.2 Handout Feedback


We signed up for neural network but also decided to pursue convolutional neural
networks since our topic material required more complex features. The material from
there two packets were extremely helpful in setting up the networks. With all that we’ve
experienced while tackling this classification problem we’ve learned an enormous amount
and can provide a point of view for people who want to learn machine learning algorithms
without having prior knowledge.

7.2.1 Neural Network


In the projection handout of Neural Network, page 2496, section 47.23.7 Cross-
Entropy Training, there is a conflict statements in the context and the equation. Be-
neath equation (47.1051), it says δl,n for all layers becomes independent of the deriva-
tives of the activation function. But at the algorithm summary at equation (47.1059),
δl,m = f 0 (zl,m ) (Wl,n−1
T δl+1,m ), it is clearly that the derivative term f 0 (zl,m ) term exist.
Jack reproduce the result in equation (47.1059) by hand calculation, so there maybe some
unclear statements need to be clarified.
When Jack tried to combine softmax algorithm and Cross-Entropy, he failed to derived
the succinct equation in (47.1058). His calculation ended up with the cross turn that cannot
be canceled. .
54 Chapter 7. Feedback & Experience
Q  
y(q) − γ(q) zL ( j)
δL ( j) = yL ( j) − γ( j) + ∑ (−e ) (7.1)
q6= j (∑Q
q=1 e
zL (q) ) − ezL (q) )

It comes from the equation (47.1041) that for softmax algorithm, the activation function in
last layer is contributed by all of the element in the pre-activated vector z.
A few of us felt that towards the end of the Neural Network packet the material and
explanations seemed to tapper off as if bits and pieces were hastily put together. There
seemed to be less explanation and intuition given. Of course as mentioned this encouraged
us to go out and find the answer, yet we think the packet abruptly changed pace.
For the most part though the packet did what Dr. Ali Sayed expressed earlier on. It
supplied us with enough foundational knowledge to apply it to a real world problem. It also
introduced to the smaller intricacies of the inner workings of much larger machine learning
approaches.

7.2.2 Convolutional Neural Network


For the convolutoinal neural network process some of thought the order could have
been reversed. Without any background in convolutional neural networks it seemed a bit
confusing to lead with masks and then go into the general convolutional neural networks.
We think that setting up the larger picture at first and breaking it down to small pieces first
then going into finer detail would help with the learning process of people without any prior
experience.
This may have been on purpose, but in terms of parameters and gaining intuition some
of us would agree that there could have been a more in depth analysis of that section with
a more nuanced approach. We understand it’s difficult but perhaps running through some
analysis done from data prior might help.
We know that our application can be specific, but it would be nice to have more
knowledge ahead of time for certain pitfalls for widely used applications.
This brings me to the final point. Given the parameter that we were instructed to use
Matlab we would assume that the general algorithm in the notes would work. However, due
to the complexity and nature of the problem, the algorithm surpassed Matlab’s limits. It is
mentioned earlier, but at times it felt like the math and theory were provided but not much
consideration on the applications and sometimes practice can be very different.
We think though for the amount of information and theory covered it did a good job
of making us understand the ins and outs of networks. We assume that the sparseness in
material that we criticized may have been to encourage research and discovery outside of
the class material. We can relate this to growing pains for first time learners. We need to do
more weight training!

7.2.3 General Feedback


Throughout this entire report general feedback has been given. Overall though, this was
unanimously our hardest most interesting course. The amount of material covered at times
seemed overwhelming. Some of us are planning to use next quarter to absorb most of it
7.2 Handout Feedback 55

in. This issue of limited time in a quarter and limited time to train our neural networks are
similar in this way.
A really minor inconvenience were the amount of typos. At times they were easy to
find, but at other times they were crippling in terms of homework or understanding a certain
derivation. We all can agree that writing a new book will have many typos but possibly
there may have been a better solution.
One other thing we wanted to raise was being able to see what other groups did. We
agree that full length presentations take up too much time. However, we at least what kind
of applications people are trying to do. Many of us would agree that if every group just had
a time limit of two minutes to do an informal introduction to what they were working on
and preliminary results we’d learn something. In addition to that it is very interesting to see
what everyone is working on.
One of the things we can appreciate is Dr. Ali Sayed’s teaching style. It’s all about
learning. In that sense we think we’ve learned a ton, much more so than in any other classes.
Dr. Ali Sayed is also very good at teaching. When he teaches you can tell when he pauses
sometimes he is thinking how to relay the information in ways we can understand. The
intuition and the way he stress important points twice really helps with learning. We are
all first-year mechanical graduate students, and we were all excited when Dr. Ali Sayed
talked about the optimal gain being the variance of xy over the variance of y. Mind blown!
why haven’t other teachers state this simple but deep line. Small tidbits like that added up
to really give us not only intuition in his class but also intuition into our own field.

7.2.4 Typos
47.1051 Neural Network Packet
• Issue: Missing N1 term averaging term
• Without it the Jemp continues and diverges. It was noted before but when
implementing algorithm it can be confusing.
• Comment: it would be helpful to note that the N is only the number of samples
• Corrected version[14][15]:
L−1
1 N−1 Q
Jemp(W, θ ) , ∑ ρkWl k2F + ∑ ∑ ln(yn (q)γn (q) (1 − yn (q)1−γn (q) ))
l=1 N n=0 q=1
(7.2)

47.1214 Convolutional Neural Network Packet


• Issue: wrong index
(DC ,C)
(1,C)
tn ,t (2,C)n ,...,tn , (7.3)

47.1022 Neural Network Packet


• Issue: typo with zero instead of θ

dl = θnl+1 (7.4)
56 Chapter 7. Feedback & Experience

47.1018 Neural Network Packet


• Issue: should be a row vector instead of column vector

(−δ3 (1) −δ3 (2)) (7.5)

47.1223a Convolutional Network Packet


• Issue: dimensions are swapped

(Sc Dc−1 × 1) (7.6)

47.1025 Neural Network Packet


• Issue: should be n1l instead of the square root

1
(7.7)
nl
47.1276 Convolutional Network Packet, pg 2559
• Issue: Extra µ term on the Neural Network gradient update. There should only
be one µ

Wl,n = (1 − 2µρ2 )Wl,n−1 − µδl+1,n (yl,n )T (7.8)

• Issue: Missing the calculation of ∆n,C before backward pass.


(1,C) (1,C) (DC ,C)
∆n,C = diag{δn , δn , ..., δn } (7.9)


• Issue: wrong sign.

θl,n = θl,n−1 + µδl+1,n (7.10)

47.1216 & 47.1276 Convolutional Neural Network Handout


• Issue: the Pc+1 subscript is incorrect if you start off with P0
• Comment: This problem persists depending on how you index your first P, since
(d,c)
at the beginning of the propagation algorithm the c subscript is zero the hn,Pc+1
(d,c)
should be hn,Pc . The same can be said for the Pc+1 term in 47.1276. Depending
how the initial value is defined it may change in more places.

7.3 How Much Did We Learn?

Colin Togashi:
"My training has been all dependent on the parameters and the initial
conditions. No, I mean it’s fun, it’s hard, it’s definitely pushing me outside my
comfort zone, but I learned a lot."
7.3 How Much Did We Learn? 57

Meng-Hao Li:

"I really learn a lot from the lectures, homework, and especially the project.
I think this course is very good start for any beginner like me who have not
touched learning algorithm before. The lectures not only state the motivation
of each algorithm clearly but also provide good intuitions to understand the
algorithms. Thus, after this class, I can learn other learning algorithm fast
because I have already known the basic concept of learning."

Jack Shue:

"Pretty a lot. I didn’t expect things would go in this direction. I thought it


would be more geared towards stochastic estimation process. It’s a good chance
to get to know what people are talking about and how these things distinguish
cars, cats, and dogs, and stuff. Sometimes can tell a story about it. It’s somehow
powerful but still mysterious in many aspects."

Gabriel Fernandez:

"I never thought I would learn so much and be trying increase my step size
even more. I’m in pain."
8. Bibliography

8.1 References
[1] The Nature Conservancy Fisheries Monitoring | Kaggle. URL: https : / / www .
kaggle.com/c/the-nature-conservancy-fisheries-monitoring (cited on
page 6).
[2] The Nature Conservancy. URL : http://www.conserveca.org/?c=2 (cited on
page 6).
[3] The Nature Conservancy Fisheries Monitoring | Kaggle. URL: https : / / www .
kaggle . com / c / the - nature - conservancy - fisheries - monitoring / data
(cited on page 8).
[4] ImageNet Tree View. URL: http : / / www . image - net . org / synset ? wnid =
n02512053 (cited on pages 8, 32).
[5] Assistant Professor Follow Jia-Bin Huang. Lecture 29 Convolutional Neural Networks
- Computer Vision Spring2015. May 2015. URL: https : / / www . slideshare .
net/jbhuang/lecture- 29- convolutional- neural- networks- computer-
vision-spring2015 (cited on page 11).
[6] OpenCV 3 Tutorial. URL: http : / / www . bogotobogo . com / python / OpenCV _
Python / python _ opencv3 _ Image _ Canny _ Edge _ Detection . php (cited on
page 14).
[7] Michael A. Nielsen. Neural Networks and Deep Learning. Jan. 1970. URL: http:
//neuralnetworksanddeeplearning.com/chap3.html (cited on page 25).
[8] CS231n Convolutional Neural Networks for Visual Recognition. URL: http : / /
cs231n.github.io/convolutional-networks/ (cited on pages 33, 37).
8.1 References 59

[9] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond Bags of Features:
Spatial Pyramid Matching for Recognizing Natural Scene Categories. URL: http:
//www- cvr.ai.uiuc.edu/ponce_grp/publication/paper/cvpr06b.pdf
(cited on page 35).
[10] Ali Sayed. Adaptation and Learning. Mar. 2017 (cited on page 36).
[11] Kaiming He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for
Visual Recognition. Apr. 2015. URL: https://arxiv.org/abs/1406.4729 (cited
on page 50).
[12] Junfeng He, Shih-Fu Chang, and Lexing Xie. “Fast kernel learning for spatial pyramid
matching”. In: (2008), pages 1–7 (cited on page 50).
[13] Kristen Grauman and Trevor Darrell. “The pyramid match kernel: Efficient learning
with sets of features”. In: Journal of Machine Learning Research 8.Apr (2007),
pages 725–760 (cited on page 50).
[14] Cross Entropy, Wikipedia. URL: https : / / en . wikipedia . org / wiki / Cross _
entropy (cited on page 55).
[15] Improving the way neural networks learn. URL: http://neuralnetworksanddeeplearning.
com/chap3.html (cited on page 55).

Você também pode gostar