Escolar Documentos
Profissional Documentos
Cultura Documentos
Paarth Tandon
ABSTRACT
During this project, four neural networks were created to diagnose breast cancer tissue as either
malignant or benign. Each network used a different combination of regularizations. The first using
no regularizations, the second using dropout, the third using batch normalization, and the fourth
using both dropout and batch normalization. The networks were then trained on a database of 569
samples provided by the University of Wisconsin that included thirty-two different features about
the cell nuclei of each sample of breast tissue. The four networks were compared in how accurate
they were at diagnosing the cell. The network that used only batch normalization was the most
accurate with an accuracy of 93.08%. The network with the lowest accuracy was the network that
used only dropout with 56.47% accuracy.
INTRODUCTION
During this experiment there were four neural networks that were run on the breast cancer dataset
provided by the University of Wisconsin (University of Wisconsin, 1995). To understand the
differences between the models some terms must be explained.
Neural networks are computer systems inspired by real life biological neural networks that exist
in life. Just like their biological counterparts they are comprised of neurons that connect into a
network. They are useful for this type of work because like the brain, they can integrate an
association between two factors. In real brains this is called associative learning (Encyclopedia
Britannica, 2016).
Neurons
Outputs
Activation Function
Weights
Input Receivers
This is a representation of a simulated neuron compared to a real neuron. The simulated neurons
are most like the multipolar neurons that exist in our central nervous system, the brain and the
spinal cord (Martini-Hutchings et al., 2018). In the simulated neuron the weights on the input
receivers can be thought of as the amount of stimulus that the neuron is receiving. These are often
started at a random value between zero and one but change as the network learns from its mistakes.
This can be compared to the dendrites of a real neuron, which receive signals from other neurons
in the body (Martini-Hutchings et al., 2018). The way new values for weights are determined is
further explained later in this paper.
The activation function in the simulated neuron takes the sum of all the weighted inputs and
governs at what thresholds the neuron will output certain values. This can be compared to how
Action Potential works in a real neuron. Like in an activation function, a threshold must be met to
send a signal through the neuron. In a real neuron this threshold is reached if the graded potential
in the axon hillock is sufficiently large. Only then will the signal pass through the axon and the
synapses to be processed by the recipient of the signal (Martini-Hutchings et al., 2018). In all of
the networks the activation functions used are ReLU and softmax. The reason for the use of these
specific function will be explained in the next section. The output of the activation function in the
simulated neuron can be compared to the signal leaving the axon terminal through the synapse
(Martini-Hutchings et al., 2018).
Neural Networks
Output
Hidden Layer
Input
Figure 3
Inputs
This is a basic representation of a machine learning model called the neural network. It can be
compared to the nervous system in our bodies. The lowest layer (triangles) represents the inputs.
These are not neurons as all they do is feed the network with data. They can be compared to the
sensory input of our bodies. In this simple model there are only two inputs but in the dataset used
there are thirty-two inputs.
Above is the ReLU function. ReLU is used because it is known to train networks faster and more
efficiently than other activation functions (Sharma, 2017). Looking at the function it seems as
though every value inputted that is less than or equal to zero will be outputted as 0. This leads to
about fifty percent of all activations having the neuron not fire because of the output being 0. This
may seem like a huge downside, but it causes the network to be “lighter” allowing it to run faster
and with better efficiency (Sharma, 2017). In certain networks the ReLU function can cause
problems because of it being too “light”, but after father testing and validation it was shown that
the networks used in this project were not affected by such problems (Sharma, 2017).
Softmax is used as the activation function in the final layer because it maps all probabilities to one
of the possible classifications (Lan, 2017). The integration done by these functions can be thought
of as how the weights between the neurons are constantly changing as the network trains, just like
how the ideas in our brain change as we are presented with more sensory information. This causes
the neurons that are making stronger associations to have higher weights.
Output
The output layer is just the output of the network. In the networks used for this project, the output
is how accurate it diagnoses breast cancer. This can be compared to the motor output from out
central nervous systems.
Supervised Learning
The type of machine learning that the networks in this project use is called supervised learning.
Supervised learning involves using a dataset of training examples with correct labels (Maini,
2017). There are two types of supervised learning: regression and classification. Regression
predicts a continually changing number (Maini, 2017). An example of this would be stock prices,
as they are constantly changing and consist of one numerical value. Classification (the type of
supervised learning that the networks in this project use) is used to assign a label on something
(Maini, 2017). An example of this would be looking at a picture of a tree and figuring out which
species it is.
Training
Training is how the network adapts to show signs of learning. During each step of training a small
amount of data is exposed to the network. This data is run through all the neurons in the network
to receive some output. This output is then compared to the expected output. The error is calculated
using an algorithm called gradient descent, a common algorithm used when working with neural
networks (Mahanta, 2017). The goal of gradient decent is to minimize error or bring it below a
certain threshold by updating the network’s weights or coefficients based on mistakes in each
iteration (Mahanta, 2017). This error is then used to update the weights between the neurons in the
networks. This process is called backpropagation (McGonagle, n.d.). The dataset used consisted
of 569 samples. Ten percent of it (57 samples) was set aside to test the accuracy of each network,
and the rest was used for training. These samples must be set aside because if the networks were
tested on samples that they already trained on, they would always get one hundred percent of them
right, as they have already seen them before. Specifically, ten percent were set aside because it is
a large enough sample to validate the networks, but it also does not take too much away from the
already small dataset.
Loss
Loss is a measurement of how efficiently the network is learning. It is not a percentage like
accuracy, but rather a representation of how many errors are being made in each epoch, or iteration
(Nielsen, 2017). To calculate this a cost function is used. The networks that were used in this
project use cross entropy as their cost functions, as it is the best function to use when dealing with
classification problems such as tumor diagnosis (Nielson, 2017). In short, it is best for a network
to reduce its loss as fast as possible.
Regularization
Adding regularizations to a neural network can increase its speed and efficiency. There are many
forms of regularization when working with neural networks. The two used in some of the networks
in this project were dropout and batch normalization. What dropout does is randomly remove some
of the neurons in the hidden layers (Budhiraja, 2016). Why this is done is to prevent overfitting.
Overfitting is when the network can accurately predict things about one specific dataset, but not a
broader dataset on the same subjects (EliteDataScience, 2017). An example of this would be a
network that is meant to classify flowers. The network is trained on a small set of eight different
species of flowers and can accurately classify them. The same network is tested on a huge database
of flowers of the same species and fails horrifically. This is because the network was proficient at
classifying the specific flowers in the original database and did not become proficient at classifying
flowers overall. Ignoring a set of neurons can prevent this because it forces the network to learn
about more robust features.
The second form of regularization is batch normalization. This normalizes the data in each epoch,
or interval, around a mean. This slows down the network but, in some cases, can increase the
accuracy of the network at the same time (Ioffe, 2015). An example of this would be if a network
is comparing a relationship between two features of two separate species of flowers; it would
perform better if each value was normalized as this would make the relationship a comparison of
variance in standard deviations rather than a comparison of two sets of numbers on different levels
of magnitude.
MATERIALS
PROGRAMMING
There were four networks used, each with three hidden layers. The first network had no
regularization put onto it. The second network used dropout. The third network used batch
normalization. The fourth network used dropout and batch normalization.
Here is the code that initializes each network. The highlighted code is what differentiates between
the different types of networks.
100%
90%
80%
70%
60%
Accuracy
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10
Epoch
No Regularization Dropout Batch Normalization Dropout and Batch Noramlization
Loss for Each Network
90
120
80
100 70
60
80
Loss
Loss
50
60
40
40 30
20
20
10
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Epoch Epoch
1600 1800
1400 1600
1400
1200
1200
1000
Loss
Loss
1000
800
800
600
600
400 400
200 200
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Epoch Epoch
CONCLUSION
This project achieved its goal of comparing the four different networks to see which was the most
accurate. The network with the highest accuracy solely used batch normalization as its form of
regularization. Compared to the network without any added regularization it was not a statistically
significant improvement when compared with an unpaired t-test. The network with the lowest
accuracy solely used dropout as a regularization method. Adding batch normalization to the
network using dropout did increase the accuracy, but not at a statistically significant amount when
compared with an unpaired t-test. The reason that dropout hurt the accuracy of the network so
heavily was because losing some of the connections while learning drastically reduced the rate at
which learning could occur. Also, because of the small size of the dataset (569 samples) there was
not enough data to make up for the lost connections. A similar situation is seen when looking at
efficiency, as the networks that used dropout started with a poor loss factor and ended with a poor
loss factor. In a stark contrast, the networks that did not use dropout became efficient quickly. The
network that used only batch normalization did become efficient slower than the network with no
regularizations, but that is also most likely because of the small dataset. If it were much larger the
graphs of both the networks would most likely look very similar.
DISCUSSION
Three major observations were made during this project. The first observation was that the
networks in this project could be greatly helped with a larger dataset for breast cancer research.
This would allow the networks that were trained using dropout to be truly validated. This is
because, theoretically, the networks trained using dropout should do a better job when diagnosing
a larger, broader database especially if it is being trained as it diagnoses. This is because networks
that use dropout have a broader understanding of the available data, as they lose many of the
specific associations they make. Networks like any of the ones in this project will only get better
as they are presented with more data.
The second observation was that a good way to validate the results of this program would be to
conduct an experiment where both the doctors and the program have to diagnose the same samples
of breast cancer tissue. This experiment would allow human error and the program’s error to be
compared.
One way to get some insight on this comparison without conducting the described experiment is
by looking at how accurately doctors diagnose breast cancer on their own. A study titled
“Diagnostic Concordance Among Pathologists Interpreting Breast Biopsy Specimens” was done
to investigate how accurately pathologists judge biopsy results. This study was published in the
Journal of the American Medical Association on March 17, 2015 (Elmore, Longton, & Carney,
2015). It involved 115 pathologists and 240 biopsy samples of breast tissue. Each pathologist was
presented all 240 samples, instructed to present their diagnosis for each sample. In the end, the
pathologists could correctly diagnose the sample only 75% of the time. This is not great compared
to the program, but a doctor is trained to do many more things compared to the program, such as
being able to treat the cancer and further complications that may arise.
The third observation made was that integrating this program into preexisting diagnosis software
could greatly help the doctors. A program like this could flag potentially tricky cases where it is
harder for the doctors to come to a concise diagnosis. It would also function as a way for doctors
to quickly get a second opinion about the case. By no means would this be something that replaces
the doctor’s diagnosis, but rather it would be a tool that doctors can use to better the accuracy of
there diagnosis and better the efficiency at which a case is diagnosed.
Machine learning is another step forward in the general automation of society. It allows even the
most complex tasks to be completed without any human interaction. Whether this strive for
automation is better for society or not is a moral and ethical question. From a technical standpoint,
complete automation of entire industries is already happening. Whether society is ready for such
a drastic change is something to be observed in the coming years.
References
Budhiraja, A. (2016, December 15). Dropout in (Deep) Machine learning. Retrieved from
https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-
to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5
EliteDataScience. (2017, September 7). Overfitting in Machine Learning: What It Is and How to
Elmore, J. G., Longton, G. M., & Carney, P. A. (2015, March 17). Diagnostic Concordance in
https://jamanetwork.com/journals/jama/fullarticle/2203798
Heaton, Ph.D., J. (2017, June 1). The Number of Hidden Layers. Retrieved from
http://www.heatonresearch.com/2017/06/01/hidden-layers.html
Ioffe, S. (2015, March 2). Batch Normalization: Accelerating Deep Network Training by
Lan, H. (2017, November 13). The Softmax Function, Neural Net Outputs as Probabilities, and
function-neural-net-outputs-as-probabilities-and-ensemble-classifiers-9bd94d75932
Mahanta, J. (2017). Keep it simple! How to understand Gradient Descent algorithm. Retrieved
from https://www.kdnuggets.com/2017/04/simple-understand-gradient-descent-
algorithm.html
Maini, V. (2017, August 19). Machine Learning for Humans, Part 2.1: Supervised Learning.
740383a2feab
Martini, F., Nath, J. L., Bartholomew, E. F., Ober, W. C., Ober, C. E., Welch, K., &
https://brilliant.org/wiki/backpropagation/
Nielsen, M. (2017, December). Neural networks and deep learning. Retrieved from
http://neuralnetworksanddeeplearning.com/chap3.html
functions-in-neural-networks-9491262884e0
University of Wisconsin. (1995, November 1). UCI Machine Learning Repository: Breast
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)