Escolar Documentos
Profissional Documentos
Cultura Documentos
I. INTRODUCTION
The United States stock market is nearly $20
trillion in total capitalization, and predicting its
next move is the crown jewel of Wall Street.
Various trading strategies have been developed
using quantitative algorithms to execute more profitable trades and enormous resources are poured
into gaining even the slightest bit of competitive
information. With the prevalence of insider trading
trials, it is clear the huge lengths some investors
will go to gain an edge. There are many financial
features that could potentially correlate with the
US marketplace, such as stock market momentum,
commodity prices, foreign exchange rates, foreign
stock exchanges, and general public opinion. Using
these features, one can develop trading strategies
tailored towards day trading (intraday, high frequency), week trading, or position trading (long
term holding).
Given the vast variety of factors that influence
the stock market, powerful analytical tools are necessary to correctly decipher and predict the erratic
movement of stock prices. Recently realized as
possessing enormous oppurtunities in this domain,
the methodology and algorithms employed in the
field of machine learning have the potential to
drastically improve upon the more traditional and
commonly used methods of stock market prediction used in the market today.
This paper aims to explore and evaluate several
different methods on their effectiveness in predicting the movement of stock prices.
The initial methods which were used to approach this problem utilize the principles of momentum and correlation trading to construct basic
training data from historical data. Afterwards, this
data is naively fed into several popular machine
learning algorithms to determine their effectiveness
III. M ETHODS
A. Momentum Trading
To begin with, we trained an algorithm online
using the past behavior of stocks as feature vectors.
For example, if a certain day had a stock price rise
in price the feature would be 1, and if the price
fell the feature would be 0. The output variable
would be whether the following days price went
up or down.
Using this feature representation, we implemented and trained a Nearest Neighbor, Decision
Tree, Naive Bayes, and Random Forest classifier
using standard libraries provided by SciKit. Using
the last 10 days as history, the stocks behavior
for the next day would be predicted. We ran the
algorithms for stock prices in the US since January
2011. Unfortunately, we were typically stuck at
around 49% to 52% accuracy rates using a wide
variety of different parameters.
At best, the Multinomial Naive Bayes algorithm
had a 54% accuracy on the Dow Jones Industrial
average. This performance is hardly better than
random prediction (and is even worse in certain
cases), so clearly stronger features are necessary
for higher performance for these algorithms.
B. Correlations Trading
In [4], the researchers use a prediction algorithm
that exploits the temporal correlations of global
stock markets and other financial data to predict
the behavior of US stocks in the following day.
Because of the fact that no financial market in the
world is isolated, all this data should provide great
This normalization step increases both the efficiency and effectiveness of the neural network
learning algorithms. Another component of the
data pre-processing step involved identifying and
filling in missing portions of the data. This was
again accomplished using the Quandl API.
After the data-preprocessing step was completed, the training feature vectors had to be constructed. This is achieved by looking at a moving frame of length k in data. Thus, a feature
vector would consist of the stock price data of
the past k days and the output variable would
be the k+1 days stock price movement (increase
or decrease). Unlike in our previous algorithms,
where the only data considered from the previous
days were whether it went up or down, this feature
representation incorporated for each past k days,
the opening price, high price, low price, and closing price of that day in order to increase precision.
Other feature vectors representations were also
tested and is elaborated in further detail below.
To achieve a good accuracy, many of the parameters of the neural network itself, what features
should be used in the first place, as well as the feature vector normalization and representation itself
had to be considered. All the numerous factors that
had to be considered in the hyper-parameter optimization, feature selection, and feature engineering
Fig. 1. The set of features used versus the accuracy gained using
these features. The numbers correspond to the following features0: Opening Price 1: High Price 2: Low Price 3: Closing Price
The results demonstrate that using all the features actually performed very effectively (around
64% accuracy which is significantly better than
random guessing given the large test set size of
45 predicted days), though ignoring the opening
price and focusing on the high price, low price,
and closing price seemed to perform similarly well.
Ultimately, in our final classifier, we chose to
leverage all the available features as it performed
the best.
2) Feature Engineering: There were two feature vector representations that were tested for
their efficacy. The first feature vector representation is that if are looking at the past k days, then we
have k*(number of features considered) number
of components in the input feature vectors. For
example, the first component of the feature vector
correspondings to the kth days opening price, the
second component corresponds to the kth days
closing price, and so until the day k days before the
current days opening and closing price is added
respectively to feature vector. The second possible
feature vector representation involves taking the
average of the past k days feature values instead of
inputting every single one of them into the neural
network. After extensive testing of both feature
vector representations with different parameters,
the neural networks trained on the second feature
vector was found to consistently outperform those
trained on the first feature vector.
3) Number of Previous Days to Train On (Granularity): Determining how many days k days into
the past we should consider is a challenging question based on principles of finance alone. Some
analysis may argue that because of the efficient
market hypothesis, there is absolutely no relation
between a previous days behavior and the nexts
and consequently no way to predict future trends.
Other analysts may argue that the efficient market
hypothesis does not hold in the existing financial
environment and that past trend data can indeed
reveal a lot about future trends. Instead of having
to debate the current economic status of the world,
we can determine how effective looking at the
past k days is by simply looking at the results of
prediction by varying k, the granularity.
The results clearly show that looking at the past
data can definitely aid in predicting a nexts day
Fig. 4.
domain
Negative Sentiment
67%
67%
66%
Positive Sentiment
72%
69%
72%
Classification Tree
The training was done on 9,000 randomly selected entries of the September stock price prediction. Then we tested the results on 4,000 of the
remaining entries from the September dataset.
TABLE II
C ONFUSION MATRIX FOR STOCK PRICE PREDICTION
Predicted -ve
Predicted +ve
TABLE III
C OMMON PERFORMANCE MEASURES
Performance Measure
Error Rate
Accuracy
Precision
Recall
F-Score
Percentage Value
25.62%
74.38%
76.11%
78.06%
77.07%
V. FUTURE WORK
We obtained a very significant success in predicting stock price the next day based on a days
twitter sentiment. Going forward, we would like
to extend this analysis in several ways.
A. Improving models
dataset would be required. Using worddependency based models of tweet modeling instead of a bag of words representation
also has the potential to increase sentiment
classification accuracy. As an example of a
more subtle model of a tweet, one can instead
represent a feature vector as a parse tree
which can be fed into an SVM as a piece of
non-vectorial data. A custom kernel defining
similarity between parse trees can then be
defined.
improving stock price prediction and generalization of our analysis by considering multiple
months over multiple years.
considering non-standard neural network
topologies such as recurrent and convolutional
networks as potential models beyond the feedforward methods used in this paper.
adding a neutral category for tweets as well
as buying decisions. Currently, even a mildly
positive sentiment and tweet would lead to a
buy decision, which may not be optimal in a
real-world setting.
years
Frictional costs could limit the profit generation, especially as our analysis recommends
daily buying and selling decisions, which is
smaller than the typical decision time (about
6 months buying for retail investors)