Você está na página 1de 14

Diagnosing Breast Cancer using Machine Learning

Paarth Tandon
ABSTRACT
During this project, four neural networks were created to diagnose breast cancer tissue as either
malignant or benign. Each network used a different combination of regularizations. The first using
no regularizations, the second using dropout, the third using batch normalization, and the fourth
using both dropout and batch normalization. The networks were then trained on a database of 569
samples provided by the University of Wisconsin that included thirty-two different features about
the cell nuclei of each sample of breast tissue. The four networks were compared in how accurate
they were at diagnosing the cell. The network that used only batch normalization was the most
accurate with an accuracy of 93.08%. The network with the lowest accuracy was the network that
used only dropout with 56.47% accuracy.
INTRODUCTION
During this experiment there were four neural networks that were run on the breast cancer dataset
provided by the University of Wisconsin (University of Wisconsin, 1995). To understand the
differences between the models some terms must be explained.

Neural networks are computer systems inspired by real life biological neural networks that exist
in life. Just like their biological counterparts they are comprised of neurons that connect into a
network. They are useful for this type of work because like the brain, they can integrate an
association between two factors. In real brains this is called associative learning (Encyclopedia
Britannica, 2016).

Neurons

Outputs

Activation Function

Weights

Input Receivers

Figure 1 (ThoughtCo.) Figure 2

This is a representation of a simulated neuron compared to a real neuron. The simulated neurons
are most like the multipolar neurons that exist in our central nervous system, the brain and the
spinal cord (Martini-Hutchings et al., 2018). In the simulated neuron the weights on the input
receivers can be thought of as the amount of stimulus that the neuron is receiving. These are often
started at a random value between zero and one but change as the network learns from its mistakes.
This can be compared to the dendrites of a real neuron, which receive signals from other neurons
in the body (Martini-Hutchings et al., 2018). The way new values for weights are determined is
further explained later in this paper.

The activation function in the simulated neuron takes the sum of all the weighted inputs and
governs at what thresholds the neuron will output certain values. This can be compared to how
Action Potential works in a real neuron. Like in an activation function, a threshold must be met to
send a signal through the neuron. In a real neuron this threshold is reached if the graded potential
in the axon hillock is sufficiently large. Only then will the signal pass through the axon and the
synapses to be processed by the recipient of the signal (Martini-Hutchings et al., 2018). In all of
the networks the activation functions used are ReLU and softmax. The reason for the use of these
specific function will be explained in the next section. The output of the activation function in the
simulated neuron can be compared to the signal leaving the axon terminal through the synapse
(Martini-Hutchings et al., 2018).

Neural Networks

Output

Hidden Layer

Input
Figure 3

Inputs
This is a basic representation of a machine learning model called the neural network. It can be
compared to the nervous system in our bodies. The lowest layer (triangles) represents the inputs.
These are not neurons as all they do is feed the network with data. They can be compared to the
sensory input of our bodies. In this simple model there are only two inputs but in the dataset used
there are thirty-two inputs.

Hidden Layers, ReLU, and Softmax


The layer of neurons above the input is called the hidden layer because they are not directly
exposed to the raw numbers from the input (Heaton, 2017). They are only exposed to weighted
numbers and they only output weighted numbers. The more layers there are in the network the
“deeper” it is considered. Also, the more layers added, the slower the program runs (Heaton, 2017).
There were three hidden layers in all the networks used in this project. The first two layers used
ReLU as their activation function and the last layer used softmax as its activation function. The
function of all the hidden layers can be compared to how our central nervous system processes and
integrates the sensory input presented to it.

ReLU(x) = max (0, x)

Above is the ReLU function. ReLU is used because it is known to train networks faster and more
efficiently than other activation functions (Sharma, 2017). Looking at the function it seems as
though every value inputted that is less than or equal to zero will be outputted as 0. This leads to
about fifty percent of all activations having the neuron not fire because of the output being 0. This
may seem like a huge downside, but it causes the network to be “lighter” allowing it to run faster
and with better efficiency (Sharma, 2017). In certain networks the ReLU function can cause
problems because of it being too “light”, but after father testing and validation it was shown that
the networks used in this project were not affected by such problems (Sharma, 2017).

Softmax is used as the activation function in the final layer because it maps all probabilities to one
of the possible classifications (Lan, 2017). The integration done by these functions can be thought
of as how the weights between the neurons are constantly changing as the network trains, just like
how the ideas in our brain change as we are presented with more sensory information. This causes
the neurons that are making stronger associations to have higher weights.

Output
The output layer is just the output of the network. In the networks used for this project, the output
is how accurate it diagnoses breast cancer. This can be compared to the motor output from out
central nervous systems.

Supervised Learning
The type of machine learning that the networks in this project use is called supervised learning.
Supervised learning involves using a dataset of training examples with correct labels (Maini,
2017). There are two types of supervised learning: regression and classification. Regression
predicts a continually changing number (Maini, 2017). An example of this would be stock prices,
as they are constantly changing and consist of one numerical value. Classification (the type of
supervised learning that the networks in this project use) is used to assign a label on something
(Maini, 2017). An example of this would be looking at a picture of a tree and figuring out which
species it is.

Training
Training is how the network adapts to show signs of learning. During each step of training a small
amount of data is exposed to the network. This data is run through all the neurons in the network
to receive some output. This output is then compared to the expected output. The error is calculated
using an algorithm called gradient descent, a common algorithm used when working with neural
networks (Mahanta, 2017). The goal of gradient decent is to minimize error or bring it below a
certain threshold by updating the network’s weights or coefficients based on mistakes in each
iteration (Mahanta, 2017). This error is then used to update the weights between the neurons in the
networks. This process is called backpropagation (McGonagle, n.d.). The dataset used consisted
of 569 samples. Ten percent of it (57 samples) was set aside to test the accuracy of each network,
and the rest was used for training. These samples must be set aside because if the networks were
tested on samples that they already trained on, they would always get one hundred percent of them
right, as they have already seen them before. Specifically, ten percent were set aside because it is
a large enough sample to validate the networks, but it also does not take too much away from the
already small dataset.

Loss
Loss is a measurement of how efficiently the network is learning. It is not a percentage like
accuracy, but rather a representation of how many errors are being made in each epoch, or iteration
(Nielsen, 2017). To calculate this a cost function is used. The networks that were used in this
project use cross entropy as their cost functions, as it is the best function to use when dealing with
classification problems such as tumor diagnosis (Nielson, 2017). In short, it is best for a network
to reduce its loss as fast as possible.

Regularization
Adding regularizations to a neural network can increase its speed and efficiency. There are many
forms of regularization when working with neural networks. The two used in some of the networks
in this project were dropout and batch normalization. What dropout does is randomly remove some
of the neurons in the hidden layers (Budhiraja, 2016). Why this is done is to prevent overfitting.
Overfitting is when the network can accurately predict things about one specific dataset, but not a
broader dataset on the same subjects (EliteDataScience, 2017). An example of this would be a
network that is meant to classify flowers. The network is trained on a small set of eight different
species of flowers and can accurately classify them. The same network is tested on a huge database
of flowers of the same species and fails horrifically. This is because the network was proficient at
classifying the specific flowers in the original database and did not become proficient at classifying
flowers overall. Ignoring a set of neurons can prevent this because it forces the network to learn
about more robust features.

The second form of regularization is batch normalization. This normalizes the data in each epoch,
or interval, around a mean. This slows down the network but, in some cases, can increase the
accuracy of the network at the same time (Ioffe, 2015). An example of this would be if a network
is comparing a relationship between two features of two separate species of flowers; it would
perform better if each value was normalized as this would make the relationship a comparison of
variance in standard deviations rather than a comparison of two sets of numbers on different levels
of magnitude.

MATERIALS

Computer Python programming language version 3


Tensorflow machine learning library Database of breast cancer tissue
screenings provided by the University of
Wisconsin

PROGRAMMING
There were four networks used, each with three hidden layers. The first network had no
regularization put onto it. The second network used dropout. The third network used batch
normalization. The fourth network used dropout and batch normalization.

Here is the code that initializes each network. The highlighted code is what differentiates between
the different types of networks.

First Network (No regularization)


class first_network: #No Regularization
def __init__(self, learning_rate, x_shape, y_shape):
self.X = tf.placeholder("float", [None, x_shape])
self.Y = tf.placeholder("float", [None, y_shape])

hidden1 = tf.Variable(tf.random_normal([x_shape, 512]))


hidden2 = tf.Variable(tf.random_normal([512, 256]))
hidden3 = tf.Variable(tf.random_normal([256, 128]))
output = tf.Variable(tf.random_normal([128, y_shape]))

hidden_bias1 = tf.Variable(tf.random_normal([512], stddev = 0.1))


hidden_bias2 = tf.Variable(tf.random_normal([256], stddev = 0.1))
hidden_bias3 = tf.Variable(tf.random_normal([128], stddev = 0.1))
output_bias = tf.Variable(tf.random_normal([y_shape], stddev = 0.1))

feedforward1 = tf.nn.relu(tf.matmul(self.X, hidden1) + hidden_bias1)


feedforward2 = tf.nn.relu(tf.matmul(feedforward1, hidden2) + hidden_bias2)
feedforward3 = tf.nn.relu(tf.matmul(feedforward2, hidden3) + hidden_bias3)

self.logits = tf.matmul(feedforward3, output) + output_bias


self.cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = self.Y,
logits = self.logits))
self.optimizer = tf.train.AdamOptimizer(learning_rate =
learning_rate).minimize(self.cost)

correct_pred = tf.equal(tf.argmax(self.logits, 1), tf.argmax(self.Y, 1))


self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

Second Network (Dropout)


class second_network: #Dropout
def __init__(self, learning_rate, x_shape, y_shape, beta = 0.00005):
self.X = tf.placeholder("float", [None, x_shape])
self.Y = tf.placeholder("float", [None, y_shape])

hidden1 = tf.Variable(tf.random_normal([x_shape, 512]))


hidden2 = tf.Variable(tf.random_normal([512, 256]))
hidden3 = tf.Variable(tf.random_normal([256, 128]))
output = tf.Variable(tf.random_normal([128, y_shape]))

hidden_bias1 = tf.Variable(tf.random_normal([512], stddev = 0.1))


hidden_bias2 = tf.Variable(tf.random_normal([256], stddev = 0.1))
hidden_bias3 = tf.Variable(tf.random_normal([128], stddev = 0.1))
output_bias = tf.Variable(tf.random_normal([y_shape], stddev = 0.1))

feedforward1 = tf.nn.dropout(tf.nn.relu(tf.matmul(self.X, hidden1) + hidden_bias1), 0.5)


feedforward2 = tf.nn.dropout(tf.nn.relu(tf.matmul(feedforward1, hidden2) + hidden_bias2),
0.5)
feedforward3 = tf.nn.dropout(tf.nn.relu(tf.matmul(feedforward2, hidden3) + hidden_bias3),
0.5)

self.logits = tf.matmul(feedforward3, output) + output_bias


self.cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = self.Y,
logits = self.logits))
self.cost += tf.nn.l2_loss(hidden1) * beta + tf.nn.l2_loss(hidden2) * beta +
tf.nn.l2_loss(hidden3) * beta + tf.nn.l2_loss(output) * beta
self.optimizer = tf.train.AdamOptimizer(learning_rate =
learning_rate).minimize(self.cost)

correct_pred = tf.equal(tf.argmax(self.logits, 1), tf.argmax(self.Y, 1))


self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

Third Network (Batch normalization)


class third_network: #Batch Normalization
def __init__(self, learning_rate, x_shape, y_shape, beta = 0.00005):
self.X = tf.placeholder("float", [None, x_shape])
self.Y = tf.placeholder("float", [None, y_shape])

hidden1 = tf.Variable(tf.random_normal([x_shape, 512]))


hidden2 = tf.Variable(tf.random_normal([512, 256]))
hidden3 = tf.Variable(tf.random_normal([256, 128]))
output = tf.Variable(tf.random_normal([128, y_shape]))

hidden_bias1 = tf.Variable(tf.random_normal([512], stddev = 0.1))


hidden_bias2 = tf.Variable(tf.random_normal([256], stddev = 0.1))
hidden_bias3 = tf.Variable(tf.random_normal([128], stddev = 0.1))
output_bias = tf.Variable(tf.random_normal([y_shape], stddev = 0.1))
feedforward1 = tf.nn.relu(tf.matmul(self.X, hidden1) + hidden_bias1)
feedforward1 = tf.layers.batch_normalization(feedforward1)
feedforward2 = tf.nn.relu(tf.matmul(feedforward1, hidden2) + hidden_bias2)
feedforward2 = tf.layers.batch_normalization(feedforward2)
feedforward3 = tf.nn.relu(tf.matmul(feedforward2, hidden3) + hidden_bias3)
feedforward3 = tf.layers.batch_normalization(feedforward3)

self.logits = tf.matmul(feedforward3, output) + output_bias


self.cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = self.Y,
logits = self.logits))
self.cost += tf.nn.l2_loss(hidden1) * beta + tf.nn.l2_loss(hidden2) * beta +
tf.nn.l2_loss(hidden3) * beta + tf.nn.l2_loss(output) * beta
self.optimizer = tf.train.AdamOptimizer(learning_rate =
learning_rate).minimize(self.cost)

correct_pred = tf.equal(tf.argmax(self.logits, 1), tf.argmax(self.Y, 1))


self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

Fourth Network (Dropout and batch normalization)


class fourth_network: #Dropout and Batch Normalization
def __init__(self, learning_rate, x_shape, y_shape, beta = 0.00005):
self.X = tf.placeholder("float", [None, x_shape])
self.Y = tf.placeholder("float", [None, y_shape])

hidden1 = tf.Variable(tf.random_normal([x_shape, 512]))


hidden2 = tf.Variable(tf.random_normal([512, 256]))
hidden3 = tf.Variable(tf.random_normal([256, 128]))
output = tf.Variable(tf.random_normal([128, y_shape]))

hidden_bias1 = tf.Variable(tf.random_normal([512], stddev = 0.1))


hidden_bias2 = tf.Variable(tf.random_normal([256], stddev = 0.1))
hidden_bias3 = tf.Variable(tf.random_normal([128], stddev = 0.1))
output_bias = tf.Variable(tf.random_normal([y_shape], stddev = 0.1))

feedforward1 = tf.nn.relu(tf.matmul(self.X, hidden1) + hidden_bias1)


feedforward1 = tf.nn.dropout(tf.layers.batch_normalization(feedforward1), 0.5)
feedforward2 = tf.nn.relu(tf.matmul(feedforward1, hidden2) + hidden_bias2)
feedforward2 = tf.nn.dropout(tf.layers.batch_normalization(feedforward2), 0.5)
feedforward3 = tf.nn.relu(tf.matmul(feedforward2, hidden3) + hidden_bias3)
feedforward3 = tf.nn.dropout(tf.layers.batch_normalization(feedforward3), 0.5)

self.logits = tf.matmul(feedforward3, output) + output_bias


self.cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = self.Y,
logits = self.logits))
self.cost += tf.nn.l2_loss(hidden1) * beta + tf.nn.l2_loss(hidden2) * beta +
tf.nn.l2_loss(hidden3) * beta + tf.nn.l2_loss(output) * beta
self.optimizer = tf.train.AdamOptimizer(learning_rate =
learning_rate).minimize(self.cost)

correct_pred = tf.equal(tf.argmax(self.logits, 1), tf.argmax(self.Y, 1))


self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
RESULTS
Presented are the charts displaying the accuracy and the efficiency of each of the networks over
each epoch, or iteration of the program. The final accuracy for the network without any
regularizations was 92% (471 of 512 samples). The final accuracy of the network that used dropout
as a regularization technique was 56% (287 of 512 samples). The final accuracy of the network
that used batch normalization as a regularization technique was 93% (476 of 512 samples). The
final accuracy of the network that used both dropout and batch normalization was 57% (292 of
512 samples). The network that uses no regularizations had a loss value of 125 at the start and
ended with a loss value of 3. The network that used dropout had a loss value of 1,672 at the start
and ended with a loss value of 970. The network that used batch normalization had a loss value of
86 and ended with a loss value of 10. The network that used both dropout and batch normalization
started with a loss value of 1,876 and ended with a loss value of 983.

Accuracy of Each Network

100%

90%

80%

70%

60%
Accuracy

50%

40%

30%

20%

10%

0%
1 2 3 4 5 6 7 8 9 10
Epoch
No Regularization Dropout Batch Normalization Dropout and Batch Noramlization
Loss for Each Network

No Regularization Batch Normalization


(First Network) (Third Network)
140 100

90
120
80
100 70

60
80
Loss

Loss
50
60
40

40 30

20
20
10

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Epoch Epoch

Dropout Dropout and Batch Normalization


(Second Network) (Fourth Network)
1800 2000

1600 1800

1400 1600

1400
1200
1200
1000
Loss
Loss

1000
800
800
600
600
400 400
200 200

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Epoch Epoch
CONCLUSION
This project achieved its goal of comparing the four different networks to see which was the most
accurate. The network with the highest accuracy solely used batch normalization as its form of
regularization. Compared to the network without any added regularization it was not a statistically
significant improvement when compared with an unpaired t-test. The network with the lowest
accuracy solely used dropout as a regularization method. Adding batch normalization to the
network using dropout did increase the accuracy, but not at a statistically significant amount when
compared with an unpaired t-test. The reason that dropout hurt the accuracy of the network so
heavily was because losing some of the connections while learning drastically reduced the rate at
which learning could occur. Also, because of the small size of the dataset (569 samples) there was
not enough data to make up for the lost connections. A similar situation is seen when looking at
efficiency, as the networks that used dropout started with a poor loss factor and ended with a poor
loss factor. In a stark contrast, the networks that did not use dropout became efficient quickly. The
network that used only batch normalization did become efficient slower than the network with no
regularizations, but that is also most likely because of the small dataset. If it were much larger the
graphs of both the networks would most likely look very similar.

DISCUSSION
Three major observations were made during this project. The first observation was that the
networks in this project could be greatly helped with a larger dataset for breast cancer research.
This would allow the networks that were trained using dropout to be truly validated. This is
because, theoretically, the networks trained using dropout should do a better job when diagnosing
a larger, broader database especially if it is being trained as it diagnoses. This is because networks
that use dropout have a broader understanding of the available data, as they lose many of the
specific associations they make. Networks like any of the ones in this project will only get better
as they are presented with more data.

The second observation was that a good way to validate the results of this program would be to
conduct an experiment where both the doctors and the program have to diagnose the same samples
of breast cancer tissue. This experiment would allow human error and the program’s error to be
compared.

One way to get some insight on this comparison without conducting the described experiment is
by looking at how accurately doctors diagnose breast cancer on their own. A study titled
“Diagnostic Concordance Among Pathologists Interpreting Breast Biopsy Specimens” was done
to investigate how accurately pathologists judge biopsy results. This study was published in the
Journal of the American Medical Association on March 17, 2015 (Elmore, Longton, & Carney,
2015). It involved 115 pathologists and 240 biopsy samples of breast tissue. Each pathologist was
presented all 240 samples, instructed to present their diagnosis for each sample. In the end, the
pathologists could correctly diagnose the sample only 75% of the time. This is not great compared
to the program, but a doctor is trained to do many more things compared to the program, such as
being able to treat the cancer and further complications that may arise.

The third observation made was that integrating this program into preexisting diagnosis software
could greatly help the doctors. A program like this could flag potentially tricky cases where it is
harder for the doctors to come to a concise diagnosis. It would also function as a way for doctors
to quickly get a second opinion about the case. By no means would this be something that replaces
the doctor’s diagnosis, but rather it would be a tool that doctors can use to better the accuracy of
there diagnosis and better the efficiency at which a case is diagnosed.

Machine learning is another step forward in the general automation of society. It allows even the
most complex tasks to be completed without any human interaction. Whether this strive for
automation is better for society or not is a moral and ethical question. From a technical standpoint,
complete automation of entire industries is already happening. Whether society is ready for such
a drastic change is something to be observed in the coming years.
References

Budhiraja, A. (2016, December 15). Dropout in (Deep) Machine learning. Retrieved from

https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-

to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5

The Editors of Encyclopedia Britannica. (2016). Associative learning. In Encyclopedia

Britannica. Retrieved from https://www.britannica.com/topic/associative-learning

EliteDataScience. (2017, September 7). Overfitting in Machine Learning: What It Is and How to

Prevent It. Retrieved from https://elitedatascience.com/overfitting-in-machine-learning

Elmore, J. G., Longton, G. M., & Carney, P. A. (2015, March 17). Diagnostic Concordance in

Interpreting Breast Biopsies. Retrieved from

https://jamanetwork.com/journals/jama/fullarticle/2203798

Heaton, Ph.D., J. (2017, June 1). The Number of Hidden Layers. Retrieved from

http://www.heatonresearch.com/2017/06/01/hidden-layers.html

Ioffe, S. (2015, March 2). Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift. Retrieved from https://arxiv.org/abs/1502.03167

Lan, H. (2017, November 13). The Softmax Function, Neural Net Outputs as Probabilities, and

Ensemble Classifiers. Retrieved from https://towardsdatascience.com/the-softmax-

function-neural-net-outputs-as-probabilities-and-ensemble-classifiers-9bd94d75932

Mahanta, J. (2017). Keep it simple! How to understand Gradient Descent algorithm. Retrieved

from https://www.kdnuggets.com/2017/04/simple-understand-gradient-descent-

algorithm.html

Maini, V. (2017, August 19). Machine Learning for Humans, Part 2.1: Supervised Learning.

Retrieved from https://medium.com/machine-learning-for-humans/supervised-learning-

740383a2feab
Martini, F., Nath, J. L., Bartholomew, E. F., Ober, W. C., Ober, C. E., Welch, K., &

Hutchings, R. T. (2018). Fundamentals of anatomy & physiology.

McGonagle, J. (n.d.). Backpropagation | Brilliant Math & Science. Retrieved from

https://brilliant.org/wiki/backpropagation/

Nielsen, M. (2017, December). Neural networks and deep learning. Retrieved from

http://neuralnetworksanddeeplearning.com/chap3.html

Sharma V, A. (2017, March 30). Understanding Activation Functions in Neural Networks.

Retrieved from https://medium.com/the-theory-of-everything/understanding-activation-

functions-in-neural-networks-9491262884e0

University of Wisconsin. (1995, November 1). UCI Machine Learning Repository: Breast

Cancer Wisconsin (Diagnostic) Data Set. Retrieved from

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Você também pode gostar