Stock Prediction With R

Stock prediction using ETFs

11 April 2018

Skills

RStudio
quantmod
timeseries
xgboost
highcharter
pysch
pROC

Stock Prediction With R

This is an example of stock prediction with R using ETFs of which the stock is a composite. To get rid of seasonality in the data, we used technical indicators like RSI, ADX and Parabolic SAR that more or less showed stationarity. The goal of the project is to predict if the stock price today will go higher or lower than yesterday. This work was done as a term project for the course IE 7275: Data Mining for Engineers @ Northeastern University.

Packages Required

xgboost
quantmod
highcharter
psych
pROC

All downloadable from CRAN repositories

Prerequisites

Knowledge of R Programming
R Studio

Dataset Description

Data used in this project is obtained from Yahoo Finance API using quantmod built in function getSymbols(). This gives us data in the form of time series xts objects. Using the last() function we can specify our time range. I’m using the last 5 years of data for this project. The following stocks/ETFs were used:

Response Variables: JPMorgan - Open, Close
Predictor Variables: FNCL - Fidelity MSCI Financials Index, IYF - iShares US Financials ETF, XLF - Financial Select Sector SPDR Fund

A keen observer would note that all the 3 Predictor variables are ETFs that relate to banking and finance stocks. JPM is composite of all the three above funds.

Visualisation of Price History

The highcharter library is a brilliant tool for generating visually appeasing and interactive charts. Although it’s free for non-commercial/academic use, it requires a license for commercial use though. This is the first time I’m playing with this library and I gotta say, it’s really neat.

The following chart was generated using highcharter.

Drawing

Check out the whole chart here

Prediction Model Description

Our goal in this project is to use ETFs to predict the value of one composite stock. The premise for this is that, we can think of an ETF as a representative for the entire industry. Banking and financial firms are all pretty much correlated to each other as even a minor policy change could potentially affect all of them. Thus, by using the performance of the ETF to train our Machine Learning models, we can arrive at a healthy and reasonable prediction for target stock : JP Morgan(JPM)

Note: This a stock prediction project done as part of a term assignment and clearly, is not to be taken as sound investment advice. Predicting stock prices in the market is more challenging and requires enormous effort and way more degrees and qualifications than what we currently have :) Cheers!

One common mistake in using time series data is that the data tends to exhibit seasonality and to arrive at an accurate measure, we need to convert it into a stationary data. Check out this article by Vegard Flovik where he talks more about this https://www.linkedin.com/pulse/how-use-machine-learning-time-series-forecasting-vegard-flovik-phd/

One way we can go about doing this is differencing the data. But since this is financial data, the quantmod package has a lot of technical indicator functions which we can use to generate indicator data that more or less gets rid of seasonality.

Some of the indicators, we have used in our model are:

RSI - Relative Stregth Index (A measure of how the stock performed scaled to 0-100 w.r.t the Weighted Moving Average) Drawing

ADX - Average Directional Indicator Drawing

Parabolic SAR Trend- Stop and Reverse Indicator Drawing

After munging out all the numbers for the indicators, we then feed it into our model. We also incorporate a lag of 1 day to avoid a lookahead bias on the data.

Machine Learning Algorithm

We will be using the xgboost algorithm with the goal of binary logistic regression. After data preparation into training (approx 70% )and test (approx 30%) sets, we then feed it to the algorithm.

Here’s the ROC Curve for our first run on 10 rounds.

Drawing

We achieved an AUC of : 0.591939755047997

To verify this claim and to further test our model, we ran KNN classification on the data set. Using a handy script I wrote, we arrived at a optimum K value of 8.

This is the ROC Curve for the k=8 KNN Classification

Drawing

We achieved an AUC of : 0.5728

XGBoost Visualisation

The DiagrammeR R package allows us to visualise the tree structure generated by xgboost. Here’s the entire structure.

Drawing

IMO, it looks really cool.

This is what we get when we zoom into one tree Drawing

Codebase and License

Here’s the full github repo for this project. This project is licensed under the MIT License - see the LICENSE.md file located in my github repo for more details.

Acknowledgments

Project Collaborator : Suman Kumar
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
R Packages used : xgboost, quantmod, highcharter, psych, pROC