Está en la página 1de 29

MSc Thesis

Stock Market Prediction with


Machine Learning and NLP

A thesis submitted in fulfilment of the requirements


for the degree

i
Abstract

Models utilizing natural language processing techniques have been proven to be useful in
predicting stock market movements. Recent academic research explored how social media
outlets has impacted financial markets movement. On the other hand, the use of machine
learning techniques to predict stock market movements on the basis of financial data has
been heavily discussed in the last decade. Numerous studies demonstrated that deep neural
networks, in particular, recurrent neural networks and long short-term memory (LSTM)
networks offer superior predictive power over the traditional machine learning models. This
study aims to combine superior machine learning tools and natural language processing
techniques to see whether the combination of both contributes any advantage on stock
market prediction.

ii
iii
Contents

Declaration of Authorship i

Abstract ii

Acknowledgements iii

Contents iv

List of Figures v

List of Tables vi

Abbreviations vii

1 Research Objectives 1

2 Literature Review 3
2.1 Natural Language Processing and Financial Markets Prediction . . . . . . . . . 3
2.2 Machine Learning Methods and Financial Markets Prediction . . . . . . . . . . 9
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Requirements Analysis 17
3.1 Overview of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Model Evaluation and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Professional, Legal, Ethical, and Social Issues 19


4.1 Professional Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Legal, Ethical, and Social Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Project Plan 20
5.1 Deliverables and Timelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Project Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Bibliography 23

iv
List of Figures

1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

v
List of Tables

1 Project Deliverables and Timelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


2 Project Risk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

vi
Abbreviations

DNN Deep Neural Networks


LSTM Long Short-Term Memory
NLP Natural Language Processing
PCA Principal Component Analysis
RNN Recurrent Neural Networks

vii
Chapter 1

Research Objectives

The goal of this study is to find out whether combining NLP techniques and advanced machine
learning tools offer better results than using a standalone methodology. Some researchers
paid particular attention to social media outlets and employed different model specifics
including collective social sentiments / moods, social sentiments regarding specific topics, and
sentiments regarding related and causal firms. Others focused on applying different machine
learning models in financial data and the majority find that deep neural networks outperform
traditional machine learning models. This study employs both ideologies to build a hybrid
classifier which could be used to recommend a buy, hold or sell of a particular stock over the
next few trading days.

In particular, the aim of the project are as follows:


1) Features reduction and features engineering on both text and financial data. Consider
all possible risk indicators when dealing with financial data. Features that could be
very useful in predicting short term financial market movements include Sharpe ratio,
Index Beta, Industrial Beta, Value at Risk, last n-days worse, last n-days peak, short-
term average to long-run average ratio, last n-days Variance, short-term variance to
long-run variance ratio.
2) Experiment with combining NLP and Deep Neural Networks to train a hybrid classifier
from the training set of both text and financial data. Of the same training set, use the

1
text features to train the NLP classifier and financial features to train the DNN
separately.
3) Validate all classifiers on out of time samples (test set)
4) Compare the accuracy results calculated with the test set to see whether a hybrid
model using both text and financial data outperforms a standalone model.

2
Chapter 2

Literature Review

This chapter of the thesis explores existing research into the available techniques of financial
market prediction, in order to form the foundation for the development of the
methodological framework of the current study. The author explores literature regarding the
use of natural language processing in financial market movement prediction, as well as
literature regarding the use of machine learning methods for financial market prediction; in
order to understand the benefits and limitations of each technique. The researcher has
limited the scope of this chapter to this particular classification, and for research that was
conducted since 2010, in order to allow for a suitable comparative analysis of important
academic works, that are recent and relevant to the current dynamic market environment.

This chapter offers the reader a comprehensive evaluation of the different methodological
frameworks used by various scholars following these two common ideologies for market
prediction. The author also discusses the findings of the scholarly works while considering the
impact of the methodological choices on the findings.

In order to allow for ease of comprehension, the chapter is divided into two sections. The first
section of the chapter focuses on academic research employing natural language processing
techniques to predict stock market movements on the basis of available news. The second
section of the chapter explores articles regarding the use of machine learning methods to
predict stock market movement on the basis of market data. The chapter concludes with a

3
discussion on the benefits and limitations of the prediction techniques as well as a reflection
on how the methodologies of the various scholarly works have influenced the methodology
of the current study.

2.1 Natural Language Processing and Financial Markets Prediction

The increase in the use of social media networks to disseminate information rapidly, and to a
wide audience, has had significant implications for researchers across multiple domains. Thus,
it is intuitive that the evolution of social media networks as a place for disseminating news,
etc. will have implications for financial markets as well. Scholars have been studying how
news sources influence market movements for decades. However, since the early 2010s,
researchers have also begun to explore how social media has influenced financial market
movement. This section of the chapter offers a discussion on the recent academic
undertakings that have studied how financial market movement can be predicted using data
from news and social media outlets.

Zhang et al. (2011) explored whether the sentiments of the users of Twitter, as identified via
their posts, could act as a predictor for stock market indices like Dow Jones Industrial Average,
NASDAQ, and S&P500. The scholars argued that because Twitter had been a reliable medium
for predicting election trends for the 2009 German federal elections, via the analysis of the
number of tweets reflecting voter preferences (Tumasjan et al., 2010), it was likely that
Twitter activity could also be used to predict financial market movements. In order to do so,
Zhang et al. (2011) compared the sentiments / moods of Twitter users with the market
movement of the indices. Essentially, the scholars tracked words and phrases to identify
investor sentiments regarding fear, optimism, etc. and correlated them with the movement
of the indices. The scholars found that the extent to which Twitter users exhibited emotions
(for day x) could be used to predict the direction in which the stock market indices would
move (for day x+1). Thus, when people had heightened sentiments, irrespective of whether
the sentiments were positive or negative, there was an inverse reaction for the Dow Jones
Industrial Average, NASDAQ and S&P500 indices. Thus, the study was one of the initial works

4
that identified the extent to which general sentiments within society influence investor
sentiments in the market.

Bollen et al. (2011) shared the sentiments of Zhang et al. (2011) regarding the possibility of
the collective mood on Twitter being able to predict the movement of the stock markets.
Bollen et al. (2011) questioned whether society could share collective sentiments, which
could ultimately influence how society behaves in particular situations. Thus, the scholars
questioned whether individual behavioural characteristics influencing decision-making could
also be visible in societies. The model developed by Bollen et al. (2011) is more complex and
nuanced than the model implemented by Zhang et al. (2011) in that it studies the collective
mood on Twitter using two tracking tools, as opposed to one. Furthermore, Bollen et al. (2011)
also classify the moods into six dimensions and evaluate which of these dimensions have
significance in predicting financial market movement. Thus, the model by Bollen et al. (2011)
offers more comprehensive understanding of how social media information can be used to
predict financial market movement.

Li et al. (2014) employ sentiment analysis to financial news articles in order to determine
whether mapping the word patterns of the news articles and identifying the way in which the
news articles are framed would yield significant results for predicting financial market
movement. The scholars’ method is unique in that it tries to identify the directionality of the
movement of the financial market on the basis of the sentiments presented within the news
article. By including sentiment analysis into the prediction framework, the process was an
improvement on the ‘bag-of-words’ technique of past researchers because of its ability to
identify the intent of the news article, as opposed to relying solely on the usage of the word
or its synonyms. The scholars combined the Harvard psychological dictionary and the
Loughran-McDonald financial sentiment dictionary within the model, creating a complex
sentiment space that was capable of identifying the intent of the content of the articles. The
model by Li et al. (2014) shows a marked improvement over existing models, in its ability to
predict the movement of daily stock prices on the Hong Kong Stock Exchange.

Bhardwaj et al. (2015) questioned whether big data analytics and sentiment analysis could be
used to predict financial market movement for the Indian stock markets. The scholars

5
theorised that employing natural language processing techniques where users’ opinions,
sentiments, feelings, and evaluations were incorporated into the process of financial market
movement prediction was likely to improve accuracy, because of the technique’s ability to
extract intelligent information from seemingly unrelated data. The paper by Bhardwaj et al.
(2015) is crucial because it offers a comprehensive understanding of the various natural
language processing techniques available and evaluates their suitability for use for predicting
financial markets movement. The model proposed by the scholars is simplistic in its
implementation and relies on easily available resources like Python programming language
and the Ubuntu platform. A crucial limitation of the study is that the proposed model was not
compared with other sentiment analysis models that were employed by previous scholars.
Therefore, the study does not offer compelling evidence regarding the accuracy and efficiency
of the model but it forms the foundation for further research into how natural language
processing techniques can improve financial market predictions.

Nguyen et al. (2015) also developed a model for stock market movement predictions that
relies on sentiment analysis of information from social media networks. In spite of numerous
existing studies by numerous scholars who had developed models on the same premise, the
model developed by Nguyen et al. (2015) is unique because their model only incorporates the
collective mood of users for specific topics. In this manner, the model differed from that of
Bollen et al. (2011), Zhang et al. (2011) and Li et al. (2014). By stressing on the sentiments for
specific topics, as opposed to the overall sentiment on the social media platform, Nguyen et
al. (2015) were able to implement the joint sentiment / topic model (JST) to facilitate
sentiment training via supervised machine learning. The model was also the first model to
explore sentiments and specific topics simultaneously to predict stock market movement.
However, in spite of the novelty of the model, the model is limited in its applicability because
it excludes macroeconomic and microeconomic factors that influence market movements.

Nayak et al. (2016) also developed a model for stock market movement prediction using
sentiment analysis and supervised machine learning algorithms. The model is an
improvement on the model by Nguyen et al. (2015) because it combines a historical price
analysis framework with the sentiment analysis technique, in order to account for the
macroeconomic and microeconomic factors affecting market movement. The scholars

6
developed a daily prediction model using supervised learning algorithms, as well as a monthly
prediction model that relies solely on historical data analysis. Thus, the scholars were able to
identify whether the impact of user sentiments can help predict market movement for longer
durations. However, the model does not acknowledge topic based sentiments that were
incorporated into the model by Nguyen et al. (2015). Experimental applications of the model
for the Indian stock market indicate that the daily prediction model offers 70% accuracy,
indicating that sentiment analysis is useful in improving the accuracy of historical data
analytical models of market movement. However, the monthly prediction model developed
by Nayak et al. (2016) offered less compelling results, indicating that monthly trends are
unlikely to be correlated highly with each other. The scholars argue that if the monthly
prediction model incorporates sentiment analysis, then its accuracy and correlation results
are likely to improve. However, the study fails to offer empirical evidence to support its
assumptions.

Li et al. (2017) extend upon the work of Bollen et al. (2011) and question whether public
sentiment on Twitter or other social media networks can help in predicting the stock price
movement for particular stocks. The study is unique because it develops a model (SMe-DA-
SA) that is efficient in collecting temporal data from social media networks. The SMe-DA-SA
model uses neuro-linguistic programming techniques to classify tweets into five different
categories, improving upon the implementation of the idea of Bollen et al. (2011) to classify
sentiments into categories. The scholars also employ the theory of adjusted residuals into the
model to identify patterns between public sentiments and stock market prices. The model
also studies the social media data for the specific sample firms, irrespective of the company
being mentioned directly or indirectly. All these techniques, when applied collectively, allow
the model to identify market sentiments more accurately than models developed by prior
scholars. This is likely why the model offers an average accuracy of 70% for the 30 sample
firms that were evaluated. It is important to note here, that irrespective of the model’s
sophistication or simplicity, scholars have been unable to offer more than 70% accuracy in
any of the models discussed above. This is likely due to the exclusion or limited inclusion of
macroeconomic and microeconomic factors into the models.

7
Katayama and Tsuda (2018) implemented sentiment analysis on the news for companies
listed on the Japanese stock market. Unlike previous scholars who conducted sentiment
analysis on social media network data, Katayama and Tsuda (2018) evaluate large quantities
of news information to identify the characteristics of the news articles, that influence stock
price movement. Furthermore, the scholars’ work differs from existing research in that it
strives to evaluate the characteristics of the news articles, and its impact for the stock price
movement of individual companies; irrespective of whether the relationship is positive or
negative. This is crucial because most previous scholars have studied the news articles for
their ability to predict the movement of the entire market or been successful in identifying a
relationship between market pessimism and downward movement of the market. Thus,
essentially, the study by Katayama and Tsuda (2018) is a more nuanced implementation of
sentiment analysis. The scholars rely on the polarity dictionary in order to identify the polarity
of the article or news data, which is a simplistic classification of the sentiments of the data.
The findings of the study indicate that along with the sentiments within the news article, the
positioning of the article in the outlet and the volume of follow up news articles also
influences market movement. Thus, the findings of the study also help in understanding how
public sentiment regarding the news article is generated, implying that the source and
frequency of the news data can influence sentiments more than the intrinsic intent of the
data.

Das et al. (2018) studied Twitter streaming data in order to predict stock market movements
for firms. Unlike prior models where archival Twitter data was studied, the model by Das et
al. (2018) relies on real-time, streaming data to determine the sentiments of the customers.
The scholars relied on the Apache Spark platform, which is renowned for its distributed
machine learning library. Thus, the study is one of the recent efforts to combine sentiment
analysis – a largely natural language processing technique, with the applicability and
scalability of machine learning tools. The scholars combine sentiment analysis with recurrent
neural networks (RNNs) because of the suitability of RNNs in training the model. The scholars’
research findings indicate that combining natural language processing techniques with
machine learning algorithms can help in improving the accuracy of prediction models.

8
2.2 Machine Learning Methods and Financial Markets Prediction

The use of algorithms, statistical models or computing techniques for the prediction of stock
market movements is not a recent development. Scholars have been implementing machine
learning techniques ranging from simplistic market simulations (Arthur et al., 1996) and
simplified neural networks (Zhang et al., 1998) to complex models as developed by Wang et
al. (2012). However, in the past few years, the emphasis on machine learning methods has
increased due to the availability and popularity of big data (Henrique et al., 2019). This section
of the chapter discusses the recent developments in the employment of machine learning
methods for the purpose of understanding and predicting the movement of financial markets,
on the basis of the information available from the markets themselves.

Goykhman and Teimouri (2018) constructed a market simulation based on the Hidden
Markov Model, employing recurrent neural networks as a means of reconstructing the
transition probability matrix for hidden sentiments from observed stock prices. The scholars
focused their efforts towards answering the question of whether observed stock prices can
be used to understand the underlying sentiment processes of the agents. This is a marked
departure from the existing viewpoint wherein the sentiment driven framework is used to
identify how the stock price dynamics would be affected by the underlying sentiment
processes. Thus, Goykhman and Teimouri (2018) flipped their perspective on the research
topic. The scholars conducted the study by implementing the assumption that the agents’
sentiment processes are emergent, i.e. the sentiment states are an accurate depiction of the
collective behaviour of the agents. The scholars also varied from existing academic pursuits
in that they ignored the question of the level of intelligence of the agents. Thus, the scholars
were able to build the assumption that all decision-making is completed before the
formulation of the driving sentiment processes. These are two specific ways in which the
study by Goykhman and Teimouri (2018) varied from previous studies.

The study by Goykhman and Teimouri (2018) was restricted in its scope to the exploration of
a simulated stock market environment and does not try to extend its application to the real-
world. The scholars also make use of two sentiment regimes – a simplistic one aligned with
the cash flow balance equation; and a more complex situation incorporating non-trivial

9
sentiment time series. The simple sentiment driven environment assumes that market
sentiments are likely to change twice during the course of the simulation and that agents
follow the buy/sell sentiments. The sophisticated situation incorporates regular switching
between various sentiments, with the sentiments following a non-trivial time series process,
using a Markov chain with a pre-set transition probability matrix. Thus, Goykhman and
Teimouri (2018) strive to recover the transition probability matrix using the Baum-Welch
algorithm of the Hidden Markov Model, via the observed stock market movement. The
scholars find that the application of the Viterbi algorithm did not yield significant results,
whereas the use of the recurrent neural network offered an accuracy of 50%, which is
significantly better than the 33% of the random score.

Fischer and Krauss (2018) recommend the use of long short-term memory (LSTM) networks
to predict financial market movement and identify non-linear structures in financial market
data. The scholars make use of deep learning, memory-free techniques of random forests,
gradient-boosted trees, and different ensembles and compare it with results from long short-
term memory networks. The premise for the study was that because deep learning
techniques have improved speech recognition, object detection, etc. in other domains, the
techniques are likely to improve accuracy of time series predictions for financial markets as
well. The scholars’ selection of LSTM networks aligns with recent developments in the field
of financial market prediction. Fischer and Krauss (2018) further differ in their methodological
choices from past scholars by implementing LSTM networks on volume-weighted-average-
prices as opposed to the closing prices in the stock markets. The scholars studied the entirety
of the S&P500 from 1992 to 2015 and found that LSTM networks yielded improved results
that were economically and statistically significant, when compared with results of random
forests, standard deep neural networks, and standard logistic regressions. These three
techniques are popular benchmarks in existing literature, thereby making the findings of the
study valuable to academia because, LSTM networks are a form of recurrent neural networks
– the accuracy of which was also found to be superior to random scores by Goykhman and
Teimouri (2018). Fischer and Krauss (2018) also succeeded in creating a robust empirical
framework using LSTM networks to facilitate the use of the technique for future time series
predictions, where there is a significant volume of noise, including noisy financial time series
data.

10
Lachiheb and Gouider (2018) employed a hierarchical deep neural network (DNN) framework
to predict stock returns. The network was trained using a 5 minute, high-frequency period for
4 years, in order to predict how that stock would perform for the next 5 minutes within the
same time period. The scholars were of a similar inclination to Fischer and Krauss (2018) and
argued that because DNNs had improved accuracy of image processing and text recognition
in other domains, it should also be able to improve financial market predictions. The study by
Lachiheb and Gouider (2018) builds upon the work of past scholars by extending prior DNN
models to include data from other stocks within the market, aside from the stocks that are
being studied. Thus, in this manner, the model of this study improves upon the models of
Chen et al. (2017) and others, that were restricted in their analysis to only study the stocks in
exclusion from the market.

By incorporating the entire market and its players into the framework, and creating a
hierarchical DNN model, Lachiheb and Gouider (2018) succeeded in improving the accuracy
of their predictions on a simulation of the Tunisian stock market by 71% compared to previous
scholars implementing a DNN model. Even though the model’s results are statistically
significant, it is important to acknowledge that the model relies on 5 minutes of training data
to predict movements for the next 5 minutes for a sample of 45 stocks. This is a remarkably
short time period and sample, should one consider the scalability of the model. However, in
spite of the limited scope of the sample of the research model, the study makes a discernible
improvement to the design of the framework, by incorporating not only the stock’s own past
performance, but also the performance of other stocks. Thus, the model is more credible for
real-world applications than past studies employing DNN models for market prediction.

Zhang et al. (2018) also employ machine learning methods to predict stock price trends. The
scholars developed a stock price tend prediction system that uses big data and unsupervised
pattern recognition to generate training samples. Furthermore, the prediction system, named
‘Xuanwu’ can be transitioned to real-world application, while simultaneously integrating
supervised machine learning models. Thus, the model developed by Zhang et al. (2018) allows
analysts and researchers to transcend the limitation of relying on human selection and
labelling of data.

11
The study discusses explicitly the process of developing training samples without human
interaction, thereby offering a practical solution to the issue of the significant volume of
transactions in financial markets, on a daily basis. The model training tool of the prediction
system generates samples by recognizing patterns in the shape of the closing prices of stocks
for predetermined fixed trade durations and clustering them using the WEKA tool. Unlike
prior studies using morphological patterning that predicted the patterns or shapes arising due
to their (weak) interaction with the trend of price movement, the model by Zhang et al. (2018)
predicts the probability of the formation of the predefined shape, which is a stronger
interaction. Thus, the model yields superior accuracy that prior applications of morphological
patterning. The scholars’ model yields efficient results in generating unsupervised training
samples. The accuracy of the model also exceeds the accuracy of other models relying solely
on random forests, etc. because the prediction system reliably incorporates supervised
machine learning models into its operability.

Kim and Won (2018) developed a hybrid model incorporating LSTM and various GARCH-type
models to help predict stock market volatility in order to improve portfolio risk management,
and hedging strategizing. The scholars combined the LSTM model with up to three GARCH-
type models, and a deep feed-forward neural network (DFN) to develop the hybrid model,
whose accuracy in predicting stock market volatility exceeded that of single technique models
significantly. By combining deep learning neural network models with GARCH-type models,
the authors were able to reduce the possibility of error in financial time-series models
dramatically. This is because the GARCH-type models help in capturing the clustering
tendency of volatility while the neural networks help in capturing non-linear relationships.
Therefore, by incorporating GARCH, EGARCH, and EWMA into the hybrid model, the scholars
were able to identify exactly which combination of GARCH-type models and neural network
models offered most accurate predictions of stock market volatility. Thus, the model is
capable of optimizing LSTM’s ability to learn long-range dependency to identify more
complicated patterns than other neural networks that are shallow. The study also identified
that in spite of combining three GARCH-type models with the LSTM model, a larger period of
forecasting increased the value of the error. However, in similar situations, the multiple
GARCH-type model, when combined with DFN, offered lower errors than the LSTM model.

12
Hiransha et al. (2018) evaluated and compared the accuracy of four types of deep learning
architectures for the stock exchanges of New York and India. The framework of the study is
unique in that it employing a training sample of one company from the National Stock
Exchange (NSE) of India to predict the stock price movement for five companies from both –
the NYSE and NSE. Each of the neural network models outperformed traditional linear models
of stock price prediction. However, the convolutional neural network performed better than
all other neural network models. The study was crucial in not only identifying which neural
network model offered improved predictions, but it also helped in determining that data from
one stock exchange could be used to train the neural network model for other stock
exchanges with similar characteristics. Thus, the study also succeeded in indicating that deep
learning, non-linear models are adaptable across markets.

Chatzis et al. (2018) built a forecasting tool to identify the probability of stock market crashes
using various machine learning algorithms. The scholars incorporate multiple machine
learning methods like deep learning tools, and boosting algorithms to forecast global financial
crises and offer an early warning system. This is a marked departure from the use of macro-
indicators that are biased and heuristically defined in the existing early warning systems. The
model explored financial data for almost 30 markets, for over 20 years in order to determine
whether financial crises also exhibit characteristics of clustering. By incorporating neural
networks, extreme gradient boosting, random forests, and support vector machines, amongst
other techniques, the model creates a complex ecosystem where the shortcomings of one
technique are offset by the benefits of the others. The model proposed by the scholar was
efficient in identifying significant market indicators that predict stock market tail events, and
employs machine learning techniques to identify the probability of the occurrence of a
financial crisis. Thus, the model offers a compelling improvement over existing early warning
systems for financial crises.

Similar to Zhang et al. (2018), and Kim and Won (2018); Long et al. (2019) also developed a
stock price prediction model using deep learning tools. However, while Zhang et al. (2018)
implemented a morphological pattern recognition system that could incorporate supervised
learning tools as well, and the model by Kim and Won (2018) combined LSTM and multiple
GARCH-type models; the model by Long et al. (2019), named ‘multi-filters neural network

13
(MFNN), employs a combination of convolutional and recurrent neuron structures to predict
price movements and feature extraction. Previous studies discussed above have already
found that recurrent neural networks and LSTM offer significantly high levels of accuracy.
However, the MFNN model offers even more accuracy than single structure networks,
indicating that employing combinations of structural networks is likely to help offset the
limitations of any individual structure. Thus, the model is more suited to accurately identifying
features of the market; correspondingly improving the credibility of the model. This is
markedly different from the traditional approach to feature identification, which is largely
based on scholars’ assumptions or derivations on the basis of the historical movement of the
stock. The model is also unique because the application of convolutional and recurrent
systems allows features to incorporate varying information and create a more sophisticated
and integrated extraction and prediction model than existing two-stage models.

Nam and Seong (2019) use multiple kernel learning to predict stock market movement for
sample firms. The scholars offer a unique solution for incorporating asymmetric relationships
between sample firms and their related firms into the prediction model; thereby addressing
another layer of ambiguity that persists in existing models. Thus, by incorporating the causal
relationships between sample firms and related firms, the directional impact arising out of
the industry or macroeconomic environment can be incorporated into the prediction model.
The model was effective in predicting stock price directional movements for sample firms,
even in the absence of news pertaining directly to firm, due to the availability of news
regarding the related/ causal firms. Thus, the inclusion of the causal relationship into the
prediction model via the use of specific machine learning algorithms increases the rate of
accuracy of prediction models relying on machine learning techniques.

Zhang et al. (2019) developed a new deep learning architecture using Generative Adversarial
Network (GAN) and Multi-Layer Perceptron (MLP) as discriminators alongside LSTM as the
generator to forecast closing stock prices. The scholars train the model using 7 factors and
the GAN framework trains the two models using the zero-sum game ideology. By including
the adversarial process, the generator acts as a means of simulating real data, whereas the
discriminator strives to identify the real data from the simulated data. When the discriminator
can no longer identify real and simulated data, the generator captures data distribution from

14
the game for predictions. The model’s experimental results indicate that it is successful in
predicting the closing prices of stocks for the real data, when compared with other deep
learning techniques.

The various applications of numerous machine learning methods, and deep learning
techniques in particular, by various scholars in the recent past has been discussed extensively
in the latest paper by Henrique et al. (2019). The scholars offer a comprehensive exploration
of 57 recent studies into financial market prediction using machine learning tools and
conclude that recent academic research indicates that models employing neural networks or
support vector machines offer higher accuracy in predictions than other machine learning
tools with 70% of the recent studies employing a version of neural networks. Thus, there is a
clear preference for neural network structures in current academic pursuits.

2.3 Summary

Irrespective of whether academia uses natural language processing or machine learning tools
to predict stock market movements; it is obvious that there are numerous possibilities for
future scholars. Neither the sentiment analysis and natural language processing techniques,
nor the deep learning machine learning tools are capable of eliminating ambiguities in the
field of predicting financial market movement, in its entirety (Henrique et al., 2019). However,
models developed using both ideologies offer distinct benefits over the other.

Models employing sentiment analysis have successfully allowed researchers to predict


market movements using collective social sentiments (Zhang et al., 2011), social sentiments
regarding specific topics (Nguyen et al., 2015), sentiments regarding related and causal firms
(Nam and Seong, 2019), etc. As such, irrespective of the specifics of the models, natural
language processing techniques are capable of gleaning important information regarding
market movement, using seemingly unrelated data sources (Nam and Seong, 2019). Thus,
incorporating more big data analysis and natural language processing techniques into models,
as a means of training machine learning models, is worth considering. The models studied in
the section of natural language processing, within this chapter, also address the inability of

15
traditional financial market prediction models to incorporate behavioural factors into the
model (Li et al., 2017). Clearly, sentiment analysis is a suitable proxy for the behavioural
component of financial market movements.

The discussion on machine learning methods, as presented above, clearly indicates that deep
neural networks such as the RNN and LSTM offer a wealth of benefit to the field of financial
market prediction (Henrique et al., 2019). Any model implementing neural networks, offers
significantly improved accuracy over traditional market movement prediction models.
Furthermore, the use of multiple neural network models and unsupervised training indicates
that the model is more likely to be able to predict market movement and market volatility
with greater accuracy, because the limitations of one model are offset by the other
(Goykhman and Teimouri, 2018, Kim and Won, 2018). Thus, incorporating machine learning
techniques into the market prediction model is likely to help in determining the level of
movement within the market with higher accuracy (Kim and Won, 2018).

In spite of the benefits of the various models discussed above, a common limitation is that
each of the models is an overtly simplified consideration of the financial markets and social
networks (Henrique et al., 2019). Even though it is humanly impossible to incorporate all of
the related factors and data into a predictive model, these models make several assumptions
and conscious eliminations of crucial factors (Arthur et al., 1996, Henrique et al., 2019). The
majority of these studies have explored and developed the models in isolation from the
macroeconomic and microeconomic factors affecting market movement and the movement
of stock prices for individual firms (Henrique et al., 2019). Arguably, these are likely to have a
significantly larger impact on market movement that the variables studied via these models.
Therefore, it is necessary that future scholars extend on these models by verifying their
adaptability to existing prediction models which incorporate macroeconomic and
microeconomic variables and factors into the prediction process.

16
Chapter 3

Requirements Analysis

3.1 Overview of Research

The aim of this project is to build a hybrid model with two ideologies: NLP and Deep Neural
Networks. The model could be used to recommend buy, hold or sell of a particular stock over
the next few trading days. The key research question is that does the hybrid model of
combining the two ideologies outperforms its standalone model in stock market prediction?

3.2 Model Evaluation and Validation

In order to assess whether the hybrid model outperforms the standalone model, there are a
few important things required to take into account:
1) Overall accuracy comparison in balanced datasets (i.e. total sells = total hold= total
buys) rather than skewed dataset
2) Model stability evaluation (i.e use of multiple time window test sets to compare
validation results)
3) Confusion Matrix Comparison, also consider true and false negatives
4) Validation results could be used to compare previous work i.e. perhaps a different
approach of features engineering greatly improved the model performance?

17
5) When comparing to previous work, focus on evaluation rather accuracy % (training
and test sets are likely to be different, previous work may not be trained with balanced
dataset)

When building either the hybrid or the standalone model, multiple experiments will be
performed. This may include testing with different model assumptions and model parameters
or even modelling with principal components rather than engineered features. As such, there
should be at least 3 different iterations in the model development stage. This approach will
allow evaluation of experiments at the end of each iteration.

18
Chapter 4

Professional, Legal, Ethical, and Social Issues

4.1 Professional Issues

All required coding will be developed in R language.. Although there are no particular coding
standards for this kind of model development project, the code will have plenty of comments
throughout, making the user easier to follow. Parameters selected, or assumptions made in
machine learning models requires justification and referencing to the relevant session of the
thesis. A number of third-party libraries are required for this study and will only be used if it
is permitted by their licence.

4.2 Legal, Ethical, and Social Issues

All terms and conditions of third-party packages will be acknowledged and respected. Ensure
both financial data and social media data are free to use for this kind of study and have no
patent issues. No productionisation plans are in place so it does not come across any
copyright issues. Furthermore, the Ethics Form is submitted separately on the project system
and the project does not come across any ethical and social issues.

19
20
Bibliography

ARTHUR, W. B., HOLLAND, J. H., LEBARON, B., PALMER, R. & TAYLOR, P. 1996. Asset pricing
under endogenous expectation in an artificial stock market.

BHARDWAJ, A., NARAYAN, Y. & DUTTA, M. J. P. C. S. 2015. Sentiment analysis for Indian stock
market prediction using Sensex and nifty. 70, 85-91.

BOLLEN, J., MAO, H. & ZENG, X. 2011. Twitter mood predicts the stock market. Journal of
computational science, 2, 1-8.

CHATZIS, S. P., SIAKOULIS, V., PETROPOULOS, A., STAVROULAKIS, E. & VLACHOGIANNAKIS, N.


2018. Forecasting stock market crisis events using deep and statistical machine learning
techniques. Expert Systems with Applications, 112, 353-371.

CHEN, H., XIAO, K., SUN, J. & WU, S. 2017. A double-layer neural network framework for high-
frequency forecasting. ACM Transactions on Management Information Systems, 7, 11.

DAS, S., BEHERA, R. K. & RATH, S. K. J. P. C. S. 2018. Real-Time Sentiment Analysis of Twitter
Streaming data for Stock Prediction. 132, 956-964.

FISCHER, T. & KRAUSS, C. 2018. Deep learning with long short-term memory networks for
financial market predictions. European Journal of Operational Research, 270, 654-669.

GOYKHMAN, M. & TEIMOURI, A. 2018. Machine learning in sentiment reconstruction of the


simulated stock market. J Physica A: Statistical Mechanics and its Applications, 492, 1729-
1740.

HENRIQUE, B. M., SOBREIRO, V. A. & KIMURA, H. 2019. Literature review: Machine learning
techniques applied to financial market prediction. J Expert Systems with Applications.

HIRANSHA, M., GOPALAKRISHNAN, E. A., MENON, V. K. & SOMAN, K. P. 2018. NSE stock
market prediction using deep-learning models. Procedia computer science, 132, pp.1351-
1362.

KATAYAMA, D. & TSUDA, K. J. P. C. S. 2018. A Method of Measurement of The Impact of


Japanese News on Stock Market. 126, 1336-1343.

KIM, H. Y. & WON, C. H. 2018. Forecasting the volatility of stock price index: A hybrid model
integrating LSTM with multiple GARCH-type models. Expert Systems with Applications, 103,
25-37.
LACHIHEB, O. & GOUIDER, M. S. 2018. A hierarchical Deep neural network design for stock
returns prediction. Procedia Computer Science, 126, 264-272.

21
LI, B., CHAN, K. C., OU, C. & RUIFENG, S. 2017. Discovering public sentiment in social media for
predicting stock movement of publicly listed companies. J Information Systems, 69, 81-92.

LI, X., XIE, H., CHEN, L., WANG, J. & DENG, X. 2014. News impact on stock price return via
sentiment analysis. Knowledge-Based Systems, 69, 14-23.

LONG, W., LU, Z. & CUI, L. 2019. Deep learning-based feature engineering for stock price
movement prediction. Knowledge-Based Systems, 164, 163-173.

NAM, K. & SEONG, N. J. D. S. S. 2019. Financial news-based stock movement prediction using
causality analysis of influence in the Korean stock market. 117, 100-112.

NAYAK, A., PAI, M. M. & PAI, R. M. J. P. C. S. 2016. Prediction models for indian stock market.
89, 441-449.

NGUYEN, T. H., SHIRAI, K. & VELCIN, J. J. E. S. W. A. 2015. Sentiment analysis on social media
for stock movement prediction. 42, 9603-9611.

TUMASJAN, A., SPRENGER, T. O., SANDNER, P. G. & WELPE, I. M. Predicting elections with
twitter: What 140 characters reveal about political sentiment. Fourth international AAAI
conference on weblogs and social media, 2010.

WANG, J.-J., WANG, J.-Z., ZHANG, Z.-G. & GUO, S.-P. 2012. Stock index forecasting based on a
hybrid model. J Omega, 40, 758-766.

ZHANG, G., PATUWO, B. E. & HU, M. Y. 1998. Forecasting with artificial neural networks:: The
state of the art. International journal of forecasting, 14, 35-62.

ZHANG, J., CUI, S., XU, Y., LI, Q. & LI, T. J. E. S. W. A. 2018. A novel data-driven stock price trend
prediction system. 97, 60-69.

ZHANG, K., ZHONG, G., DONG, J., WANG, S. A. & WANG, Y. 2019. Stock Market Prediction
Based on Generative Adversarial Network. Procedia Computer Science, pp.400-406.

ZHANG, X., FUEHRES, H., GLOOR, P. A. J. P.-S. & SCIENCES, B. 2011. Predicting stock market
indicators through twitter “I hope it is not as bad as I fear”. 26, 55-62.

22

También podría gustarte