Está en la página 1de 38

http://www.michaeljgrogan.

com/

Regression Analysis and Data Structuring


Methods: R and Python

About The Author


My name is Michael Grogan. I am a data scientist with a profound passion
for statistics and programming. Please visit my website where you can find
my latest thoughts on data science, as well as detailed tutorials in Python,
R, SQL, and other programming languages commonly used in data science.
You can get in contact with me directly via e-mail, and please don’t forget to
keep up to date with my latest data science content!
Regression Analysis and Data Structuring Methods
R and Python

Contents

Introduction...................................................................................................................1

Basic Principles of Regression Analysis........................................................................2

Ordinary Least Squares – An Analysis of Stock Returns..............................................5

Variance Inflation Factor Test For Multicollinearity.....................................................8

Running an OLS Regression in Python – use of the pandas and statsmodels


librares...........................................................................................................................9

Serial Correlation and the Cochrane-Orcutt Remedy................................................11

Stationarity and Cointegration Testing......................................................................18

ARIMA Model – Stock Price Forecasting Using R........................................................24

Implementation of ARIMA Model in Python...............................................................29

Data Cleaning and Structuring Tips in R.....................................................................33

http://www.michaeljgrogan.com/
Regression Analysis and Data Structuring Methods
R and Python

Introduction
Hello and thank you for ordering the “Regression Analysis and Data Structuring Methods: R
and Python” e-book! My name is Michael Grogan, and I am a data scientist and statistician.

R and Python are the two cornerstone languages within the field of data science, which I have
utilised both in analysis of data for a wide variety of client-based projects, as well as for my own
purposes (particularly when it comes to analysing market-related data).

The purpose behind this e-book is to provide an introduction into how these languages can be
utilised to conduct cross-sectional and time-series regression analysis, as well as how to
structure data more effectively to facilitate analysis.

http://www.michaeljgrogan.com/ 1
Regression Analysis and Data Structuring Methods
R and Python

Basic Principles of Regression Analysis


When we use regression analysis, we are doing so to predict the impact of a change in one
variable on another. For instance, suppose that we have a dependent variable (Y) denoting
consumption, and an independent variable (X) denoting income. Assume we have the following
regression equation:

Consumption = $5000 + $1.50(Income)

This regression equation is divided into two parts:

• Intercept: In this equation, $5000 denotes our intercept. While the intercept can be
spurious in many cases, $5000 is our minimum consumption level; i.e. assuming no
increase in a person's income, this is the minimum amount a person will spend to
“consume” in any given period.
• Beta Coefficient: Our coefficient of $1.50 refers to the change in consumption given a
unit change in income. In this case, if income increases/decreases by $1 then
consumption also increases/decreases by $1.50. Now, let us say that income increases by
$100. What's going to happen? You guessed it – consumption will increase by $150
(100*1.5 = 150). In this case, total consumption would add up to $5000 = $150 = $5150.

Generally speaking, datasets can be one of three types:

• Cross-Sectional: A cross-sectional type dataset is one where all data is collected at a


specific point in time. For instance, surveying the heights of 100 random people over the
same time period is cross-sectional, since such data cannot vary through time.

• Time-series: A time-series dataset is one where data is collected over time and changes
with time. For instance, when we analyse the returns of a currency, stock, market index,
etc, this is a time series dataset since the data is streamed over different periods.

• Panel data: Panel data consists of data that is both time series and cross sectional.

When conducting regression analysis, it is not only the results of our analysis that we are
interested in – but also the statistical significance of those results. When we compute a
regression equation (as above), we are actually estimating a trend line based on a range of
observations.

http://www.michaeljgrogan.com/ 2
Regression Analysis and Data Structuring Methods
R and Python

For instance, we can see that the trend line below is formed based on an estimate of the various
observations in our dataset (a scatter plot). The distance between the trend line and our
observation is known as our residual. Our primary goal is to minimise the distance between the
observed values and our trend line. Should the distances between the two become too large,
then we risk statistical insignificance in our model, where the true values deviate greatly from
our expected values, so as to render any predictive analysis from our model invalid.

Conventionally, we measure statistical significance through the t-statistic – given that we are
using a 5% level of significance (or a 95% confidence interval) – the p (probability) value is what
allows us to determine significance. For instance, if our p-value for a particular coefficient lies
below 0.05, then we assume that the coefficient is significant at the 5% level. The most common
form of significance test is based on a two-tailed test. For instance:

Null Hypothesis (the result we do not expect): Income = 0

Alternative Hypothesis (the result we do expect): Income ≠ 0

If our p-value comes back as significant (under 0.05 at the 5% level), then we reject the null
hypothesis and conclude that the income variable is significantly different from zero.

http://www.michaeljgrogan.com/ 3
Regression Analysis and Data Structuring Methods
R and Python

Normal Distribution Curve

As previously explained, statistical computing packages will typically calculate a p-value from
which we can decipher whether our coefficient is significant at a certain threshold. However, in
the instance where we wish to calculate the t-statistic manually, we are comparing the same to
the critical value to determine whether or not to reject the null hypothesis. For example:

t-value = Estimate/Standard Error

• Assuming estimate of 681.402 along with a standard error of 103.270, this yields a t-
value of 6.598 (681.402/103.270 = 6.598).
• At the 5% significance level our z-value is 1.96 at the 95% confidence level (which is the
critical value for a normal distribution at this level).
• Given that 6.598 > 1.96, we would reject the null hypothesis at the 5% significance level
and conclude that the alternative hypothesis is significantly different from zero.

http://www.michaeljgrogan.com/ 4
Regression Analysis and Data Structuring Methods
R and Python

Ordinary Least Squares – An Analysis of Stock Returns


In the website tutorial given on Ordinary Least Squares (which I recommend reviewing before
reading further), we ran an analysis on a hypothetical dataset of 49 stocks. The purpose of the
regression was to determine the impact of dividends, earnings and debt to equity on stock
returns. We are using a cross-sectional dataset in this instance – meaning that all data is
collected at a specific point in time.

Given that we have laid out the essentials behind running and interpreting a regression on this
dataset, the primary aim here is to illustrate the commands run in the R Programming
Language to read data from the appropriate csv file, and perform the various calculations.

1. To download the Rstudio IDE package, please refer to http://www.rstudio.com/.

2. Save the ols_stock csv file to the relevant directory; e.g. C:\\Users\\Your Computer
Name\\Documents\R\\ols_stock.csv.

3. Open the RStudio interface and click File -> New File -> R Script.

4. Set the relevant directory by entering the command:

setwd ("C:\\Users\\Your Computer Name\\Documents\R")

5. Read the relevant data:

mydata<- read.csv("C:\\Users\\Your Computer Name\\Documents\R\\ols_stock.csv")


attach(mydata)

6. Run the OLS regression:

reg <- lm(stock_return_scaled ~ dividend + earnings_ranking + debt_to_equity)


summary(reg)

7. Plot the data:

plot (stock_return_scaled ~ dividend + earnings_ranking + debt_to_equity)


reg1 <- lm(stock_return_scaled ~ dividend + earnings_ranking + debt_to_equity)

8. Redefine the variables:

Y <- cbind(stock_return_scaled)
X <- cbind(dividend, earnings_ranking, debt_to_equity)

http://www.michaeljgrogan.com/ 5
Regression Analysis and Data Structuring Methods
R and Python

9. Install the “car” library:

library(car)

10. Conduct tests for multicollinearity by regressing the independent variables against
each other, along with conducting a variance inflation factor (VIF) test:

reg1m <- lm(dividend ~ earnings_ranking + debt_to_equity)


summary(reg1m)
reg2m <- lm(earnings_ranking ~ dividend + debt_to_equity)
summary(reg2m)
reg3m <- lm(debt_to_equity ~ earnings_ranking + dividend)
summary(reg3m)

vif(reg1m)
vif(reg2m)
vif(reg3m)

11. Conduct a Breusch-Pagan test for heteroscedasticity:

library(lmtest)
bptest(Y ~ X)

Regression Results

When we run the model initially, we see that the “dividend” variable comes back as
insignificant:

Call:
lm(formula = stock_return_scaled ~ earnings_ranking + dividend +
debt_to_equity)
Residuals:
Min 1Q Median 3Q Max
-142.56 -53.63 -15.25 22.21 603.66
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 681.402 103.270 6.598 4.03e-08 ***
earnings_ranking -10.147 2.465 -4.116 0.000162 ***
dividend -102.704 76.943 -1.335 0.188653
debt_to_equity -182.134 52.068 -3.498 0.001068 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 120.1 on 45 degrees of freedom
Multiple R-squared: 0.3573, Adjusted R-squared: 0.3144
F-statistic: 8.338 on 3 and 45 DF, p-value: 0.0001616

http://www.michaeljgrogan.com/ 6
Regression Analysis and Data Structuring Methods
R and Python

If we drop this variable and run the regression again, we obtain:

Call:
lm(formula = stock_return_scaled ~ earnings_ranking + debt_to_equity)
Residuals:
Min 1Q Median 3Q Max
-138.86 -70.46 -15.50 32.15 619.21
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 605.958 87.161 6.952 1.07e-08 ***
earnings_ranking -7.907 1.821 -4.342 7.68e-05 ***
debt_to_equity -213.532 46.845 -4.558 3.81e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 121.1 on 46 degrees of freedom


Multiple R-squared: 0.3318, Adjusted R-squared: 0.3028
F-statistic: 11.42 on 2 and 46 DF, p-value: 9.384e-05

From the above, our regression equation is as follows:


stock_return_scaled = 605.958 – 7.907 (earnings_ranking) – 213.532 (debt_to_equity)

With our stock return (dependent variable) being measured in basis points, our regression
equation can be interpreted as follows:

• A higher earnings ranking (or lower earnings rate in ranking terms, i.e. a ranking of 10 has
a lower earning rate than a ranking of 1) corresponds to a –7.907 drop in a stock's return
in basis points.
• An increase of the debt/equity ratio by 1 corresponds to a -182.134 drop in a stock's
return in basis points. On an incremental basis (-213.532/100), an increase of 0.01 in the
debt/equity ratio corresponds to a -2.13 drop in basis points in overall stock returns.

Note that our Breusch-Pagan test for heteroscedasticity is insignificant at the 5% level. However,
you will also notice that our dependent variable has scaled the stock returns. This means that we
are calculating the returns assuming a constant market capitalisation of $100 million. This is
important as the returns of various companies will vary depending on size.

For instance, on a particular stock market index – the large-cap stocks will typically skew the
overall return to a greater extent than the small-caps. In other words, we have an uneven
variance across our samples, and this will result in unreliable hypothesis testing. However, from
scaling our returns in this manner the Breusch-Pagan test for heteroscedasticity came back as
statistically insignificant at the 5% level, and therefore we cannot reject our null hypothesis of
homoscedasticity.

http://www.michaeljgrogan.com/ 7
Regression Analysis and Data Structuring Methods
R and Python

Variance Inflation Factor Test For Multicollinearity


Additionally, it is also necessary to test for multicollinearity using the Variance Inflation Factor
(VIF) test. Multicollinearity arises when two independent variables are significantly correlated
with each other, which biases our OLS estimates and leads to unreliable hypothesis testing.

We test for this using the Variance Inflation Factor test, where a reading of 5 or higher is
considered to be a warning sign for multicollinearity.

From the below, we see that our vif function is below 5 across our independent variables.
Therefore, our VIF test suggests that our independent variables are free of multicollinearity.

> vif(reg1m)
earnings_ranking debt_to_equity
2.215381 2.215381
> vif(reg2m)
dividend debt_to_equity
2.696695 2.696695
> vif(reg3m)
earnings_ranking dividend
4 4

http://www.michaeljgrogan.com/ 8
Regression Analysis and Data Structuring Methods
R and Python

Running an OLS Regression in Python – use of the pandas


and statsmodels libraries
In this instance, we use the dataset that we initially regressed in R for the Ordinary Least Squares
tutorial pertaining to stock market data. To form a linear regression model, we set up our model
in statsmodels as follows:

import numpy as np
import pandas
import statsmodels.api as sm
import statsmodels.formula.api as smf

variables = pandas.read_csv('file.csv')
y = variables['stock_return_scaled']
x1 = variables['dividend']
x2 = variables['earnings_ranking']
x3 = variables['debt_to_equity']

x = np.column_stack((x1,x2,x3))
x = sm.add_constant(x, prepend=True)

results = smf.OLS(y,x).fit()
print(results.summary())

In addition to defining our independent variables (x1, x2, x3), we see that we are stacking our
independent variables into a single x variable using the x = np.column_stack((x1,x2,x3))
command.

This allows our model to take into account multiple parameters and generate detailed output by
first defining our results through results = smf.OLS(y,x).fit() and then generating a summary of
the results through the print(results.summary()) command.

Once we run this code, we generate the following output in Python:

OLS Regression Results


==============================================================================
Dep. Variable: y R-squared: 0.363
Model: OLS Adj. R-squared: 0.319
Method: Least Squares F-statistic: 8.342
Date: Sun, 10 Jul 2016 Prob (F-statistic): 0.000168
Time: 20:55:58 Log-Likelihood: -295.45
No. Observations: 48 AIC: 598.9
Df Residuals: 44 BIC: 606.4
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 680.7824 102.383 6.649 0.000 474.443 887.122

http://www.michaeljgrogan.com/ 9
Regression Analysis and Data Structuring Methods
R and Python

x1 -116.6710 76.995 -1.515 0.137 -271.844 38.502


x2 -10.5126 2.459 -4.275 0.000 -15.469 -5.556
x3 -168.7207 52.589 -3.208 0.002 -274.706 -62.735
==============================================================================
Omnibus: 62.282 Durbin-Watson: 1.256
Prob(Omnibus): 0.000 Jarque-Bera (JB): 530.769
Skew: 3.245 Prob(JB): 5.56e-116
Kurtosis: 17.942 Cond. No. 189.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.

As we can see, the statsmodels library allows us to generate highly detailed output on a level
similar to R, with additional statistics such as skew, kurtosis, R-Squared and AIC.

Note that while differences are slight, the model output corresponds to that which we originally
obtained when we ran the original regression in R. In this regard, dropping the insignificant x1
(“dividend”) variable would also yield the same regression results obtained in R.

http://www.michaeljgrogan.com/ 10
Regression Analysis and Data Structuring Methods
R and Python

Serial Correlation and the Cochrane-Orcutt Remedy


We have already looked at issues that can arise when using OLS to analyse cross-sectional data.
However, what about time series data?

Serial correlation (also known as autocorrelation) is a violation of the Ordinary Least Squares
assumption that all observations of the error term in a dataset are uncorrelated. In a model with
serial correlation, the current value of the error term is a function of the one immediately
previous to it:

et = ρe(t-1) + ut

where e = error term of equation in question; ρ = first-order autocorrelation coeficient; u = classical


(not serially correlated error term)

This issue is quite endemic in time-series models, given that time series data is hardly ever
random and often shows particular patterns and relationships between past and future data.

In this particular example, the relationship between oil prices and fluctuations in the S&P 500
stock market index is analysed for the period June 2015 – October 2016. An Ordinary Least
Squares regression is run to model the relationship between the same, and a Durbin-Watson test
and Cochrane-Orcutt procedure is applied to test and remedy this condition respectively.

The OLS model used to describe this relationship is:

YSP500 Prices = Intercept + XOil Prices

Call:
lm(formula = gspc ~ oil)

Residuals:
Min 1Q Median 3Q Max
-195.309 -46.802 6.726 45.612 139.918

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1768.1278 20.2744 87.21 <2e-16 ***
oil 6.5421 0.4488 14.58 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1
Residual standard error: 70.44 on 499 degrees of freedom
Multiple R-squared: 0.2986, Adjusted R-squared: 0.2972
F-statistic: 212.5 on 1 and 499 DF, p-value: < 2.2e-16

http://www.michaeljgrogan.com/ 11
Regression Analysis and Data Structuring Methods
R and Python

Next, we test the model for the presence of serial correlation using the Durbin-Watson test. With
a p-value below 0.05 as shown, this is an indication that serial correlation is present in the model
and needs to be remedied.

> dwtest(reg1)

Durbin-Watson test
data: reg1
DW = 0.047108, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater
than 0

Notably, the Cochrane-Orcutt remedy only works when the data is an AR(1) stationary process. In
other words, taking a first difference of the data results in a stationary process whereby the data
has a constant mean, variance and autocorrelation.

Residuals of OLS Regression (Serial Correlation Present)

http://www.michaeljgrogan.com/ 12
Regression Analysis and Data Structuring Methods
R and Python

Residuals of First Differenced OLS Regression (Serial Correlation Eliminated)

Consequences of Serial Correlation

According to Studenmund (2010) – a textbook which I find gives a solid introduction to the
particulars of serial correlation – the consequences of this condition for a regression model is as
follows:

► Ordinary Least Squares is no longer the minimum variance estimator among all linear
unbiased estimators.

► Standard errors encounter significant bias in the face of serial correlation, increasing the risk
of making a Type 1 or Type 2 error.

► Our coefficient estimates remain unbiased in the face of serial correlation.

First Differencing

The purpose of first differencing – as mentioned – is to transform a non-stationary time series


into a stationary one. This is necessary in order to ensure that the data in question follows a
stationary AR(1) process.

http://www.michaeljgrogan.com/ 13
Regression Analysis and Data Structuring Methods
R and Python

Therefore, when first differences of the variables are obtained, a summary of the regression with
the first differenced variables is also calculated.

Call:
lm(formula = diff_gspc ~ diff_oil)
Residuals:
Min 1Q Median 3Q Max
-65.068 -3.908 -0.183 5.016 73.286
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1826 0.6836 0.267 0.79
diff_oil 5.6864 0.6935 8.199 2.07e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1

Residual standard error: 15.28 on 498 degrees of freedom


Multiple R-squared: 0.1189, Adjusted R-squared: 0.1172
F-statistic: 67.23 on 1 and 498 DF, p-value: 2.067e-15

To ensure that the data follows an AR(1) process, a formal Dickey-Fuller test can be run and the
autocorrelation functions plotted.

When ADF tests are run on both the ordinary and first differenced variables, we see that the
former have p-values above 0.05 (indicating non-stationarity), while the latter have p-values
below this threshold (indicating stationarity).

Dickey-Fuller Test on model variables

> adf.test(gspc)

Augmented Dickey-Fuller Test


data: gspc
Dickey-Fuller = -2.38, Lag order = 7, p-value = 0.4174
alternative hypothesis: stationary

> adf.test(oil)
Augmented Dickey-Fuller Test

data: oil
Dickey-Fuller = -2.1883, Lag order = 7, p-value = 0.4986
alternative hypothesis: stationary

http://www.michaeljgrogan.com/ 14
Regression Analysis and Data Structuring Methods
R and Python

Dickey-Fuller test on model variables (first-differenced)

> adf.test(diff_gspc)
Augmented Dickey-Fuller Test

data: diff_gspc
Dickey-Fuller = -8.2973, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary
Warning message:
In adf.test(diff_gspc) : p-value smaller than printed p-
value
> adf.test(diff_oil)

Augmented Dickey-Fuller Test


data: diff_oil
Dickey-Fuller = -8.1493, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In adf.test(diff_oil) : p-value smaller than printed p-
value

A plot of the autocorrelation functions shows a sudden drop in correlations after lag 1 for the first
differenced regression, also indicating a stationary series.

http://www.michaeljgrogan.com/ 15
Regression Analysis and Data Structuring Methods
R and Python

Cochrane-Orcutt Remedy

Given that the presence of a stationary AR(1) series has been established, the Cochrane-Orcutt
method is appropriate to use in this case to remedy serial correlation.

The method works by estimating a ρ value, that is, a correlation value between the residuals and
its lagged values, where:

yt = yt − pp̂yt-1
xt = xt − pp̂xt-1

When the autocorrelation function was run on the residuals of the initial regression, the
following autocorrelation values were returned:

Lag 0 1 2 3 4 5 6 7 8 9
Autocorrelation 1 0.98 0.95 0.94 0.91 0.9 0.88 0.86 0.85 0.83

http://www.michaeljgrogan.com/ 16
Regression Analysis and Data Structuring Methods
R and Python

The Cochrane-Orcutt estimator in R estimates the appropriate value of pp̂ to use in estimating the
new regression. The purpose of pp̂ is to formulate a regression where the correlations between
one error term and the previous are removed so that each observation becomes IID
(independent and identically distributed).

When the Cochrane-Orcutt procedure is run, the updated regression is displayed and it is
observed that the pp̂ value of 0.977051 generated is very close to the correlation coefficient of
0.975 calculated for lag 1 initially. Therefore, this value is used to calculate the new regression
output as below:

> orcuttreg1
Cochrane-orcutt estimation for first order autocorrelation

Call:
lm(formula = gspc ~ oil)
number of interaction: 4
rho 0.977051

Durbin-Watson statistic
(original): 0.04711 , p-value: 4.902e-107
(transformed): 2.08847 , p-value: 8.393e-01

coefficients:
(Intercept) oil
1811.922439 5.737714

Moreover, the Durbin-Watson statistic has been transformed with a p-value above 0.05,
indicating that the serial correlation in the model has been eliminated.

http://www.michaeljgrogan.com/ 17
Regression Analysis and Data Structuring Methods
R and Python

Stationarity and Cointegration Testing

When we refer to a time series as stationary, we mean to say that its mean, variance and
autocorrelation are all consistent over time. Cointegration, on the other hand, is when we have
two time series that are non-stationary, but a linear combination of them results in a stationary
time series. So, why is the concept of stationarity important? Well, a large purpose of time series
modelling is to be able to predict future values from current data.

This task becomes much more difficult when mean, variance and autocorrelation parameters do
not follow a consistent pattern over time, resulting in an unreliable time series model.
Additionally, this problem is compounded by the fact that time series datasets, by their very
nature experience non-stationarity as the presence of factors such as seasonal trends skew the
mean and variance. In order to test for stationarity, we use the Dickey-Fuller test. This test works
by testing for a unit root in the data where:

Δy(t) = δy(t-1) + u(t)


where Δy(t) is the first difference of y and δ=0 represents our unit root

Our null and alternative hypotheses are as follows:

H0 (Null Hypothesis)

θ = 0 (data is non-stationary and must be differenced to make the data stationary)

HA (Alternative Hypothesis)

θ < 0 (data is stationary)

When we previously looked at the example of autocorrelation testing between oil prices and the
S&P 500 stock market index, we saw that our data followed an AR(1) stationary process, i.e. the
raw data did not have a constant mean, variance and autocorrelation, but the differenced series
did.

http://www.michaeljgrogan.com/ 18
Regression Analysis and Data Structuring Methods
R and Python

Stationarity Testing

In seeking to test for stationarity, we primarily wish to determine if our series is trend stationary
or whether a unit root is present. Trend stationary is where a process is fully stationary once the
trend component of the time series is removed. This is not the case in a time series that has a
unit root present.

There are three particular tests of interest that we can use to determine stationarity, 1) the KPSS
test, 2) the Dickey-Fuller test, and 3) the Phillips-Perron test. The null hypothesis of the KPSS test
is a unit root, while the alternative is trend stationarity. For the latter two tests, the null
hypothesis is non-stationarity, whereas the alternative is the presence of a unit root.

When we run the tests for both the S&P 500 (gspc) and oil time series, we see that when the
series are run without differencing, a unit root is indicated to be present. However, first-
differencing the series indicates stationarity across these tests:

http://www.michaeljgrogan.com/ 19
Regression Analysis and Data Structuring Methods
R and Python

KPSS

> k1<-kpss.test(gspc,null="Trend")
Warning message:
In kpss.test(gspc, null = "Trend") : p-value smaller than
printed p-value
> k1
KPSS Test for Trend Stationarity
data: gspc
KPSS Trend = 0.95907, Truncation lag parameter = 5, p-
value = 0.01
> k1$statistic
KPSS Trend
0.9590747
> k1diff<-kpss.test(diff_gspc,null="Trend")
Warning message:
In kpss.test(diff_gspc, null = "Trend") :
p-value greater than printed p-value

> k1diff
KPSS Test for Trend Stationarity
data: diff_gspc
KPSS Trend = 0.04457, Truncation lag parameter = 5, p-
value = 0.1
> k1diff$statistic
KPSS Trend
0.04457013

> k2<-kpss.test(oil,null="Trend")
Warning message:
In kpss.test(oil, null = "Trend") : p-value smaller than
printed p-value

> k2
KPSS Test for Trend Stationarity
data: oil
KPSS Trend = 1.5315, Truncation lag parameter = 5, p-value
= 0.01

> k2$statistic
KPSS Trend
1.531487

> k2diff<-kpss.test(diff_oil,null="Trend")
Warning message:
In kpss.test(diff_oil, null = "Trend") :
p-value greater than printed p-value

http://www.michaeljgrogan.com/ 20
Regression Analysis and Data Structuring Methods
R and Python

> k2diff
KPSS Test for Trend Stationarity
data: diff_oil
KPSS Trend = 0.065387, Truncation lag parameter = 5, p-
value = 0.1
> k2diff$statistic
KPSS Trend
0.06538696

Dickey-Fuller

> adf.test(gspc)
Augmented Dickey-Fuller Test

data: gspc
Dickey-Fuller = -2.38, Lag order = 7, p-value = 0.4174
alternative hypothesis: stationary

> adf.test(oil)

Augmented Dickey-Fuller Test


data: oil
Dickey-Fuller = -2.1883, Lag order = 7, p-value = 0.4986
alternative hypothesis: stationary

> adf.test(diff_gspc)

Augmented Dickey-Fuller Test


data: diff_gspc
Dickey-Fuller = -8.2973, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary
Warning message:
In adf.test(diff_gspc) : p-value smaller than printed p-
value
> adf.test(diff_oil)
Augmented Dickey-Fuller Test

data: diff_oil
Dickey-Fuller = -8.1493, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In adf.test(diff_oil) : p-value smaller than printed p-
value

http://www.michaeljgrogan.com/ 21
Regression Analysis and Data Structuring Methods
R and Python

Phillips-Perron

> pp.test(gspc)
Phillips-Perron Unit Root Test
data: gspc
Dickey-Fuller Z(alpha) = -12.86, Truncation lag parameter
= 5, p-value = 0.3923
alternative hypothesis: stationary
> pp.test(oil)

Phillips-Perron Unit Root Test


data: oil
Dickey-Fuller Z(alpha) = -6.6153, Truncation lag parameter
= 5, p-value = 0.7407
alternative hypothesis: stationary

> pp.test(diff_gspc)

Phillips-Perron Unit Root Test


data: diff_gspc
Dickey-Fuller Z(alpha) = -488.86, Truncation lag parameter
= 5, p-value = 0.01
alternative hypothesis: stationary
Warning message:
In pp.test(diff_gspc) : p-value smaller than printed p-
value

> pp.test(diff_oil)

Phillips-Perron Unit Root Test


data: diff_oil
Dickey-Fuller Z(alpha) = -522.48, Truncation lag parameter
= 5, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In pp.test(diff_oil) : p-value smaller than printed p-
value

Two-Step Engle Granger Method

On the issue of cointegration, we have already established that this is present when a linear
combination of the non-stationary data transforms the same into a stationary series. This means
that the time series show correlation that is statistically significant and not simply due to chance.

http://www.michaeljgrogan.com/ 22
Regression Analysis and Data Structuring Methods
R and Python

However, note that simply using a test such as the adf.test to test the residuals of a linear
regression is not appropriate on this case. The reason for this is that the critical values will differ
for the adf test since residual based critical values are not the same as that for a standard ADF.

In this regard, we use the egcm command built into R. This command runs the two-step Engle-
Granger method on our data, where the procedure selects the appropriate values for α, β, and ρ
that best fit the following model:

Y [i] = α + β ∗ X[i] + R[i]


R[i] = ρ ∗ R[i − 1] + ε[i]
ε[i] ∼ N(0, σ2)

In other words, given that oil prices and the S&P 500 variable are non-stationary, then
cointegration between the two would indicate that a linear combination of these two variables
must be stationary. We form a linear combination between these two variables and test for
cointegration as follows:

library(egcm)
egcm(x, y) #where x = oil and y = gspc

Upon running egcm, our output is as follows:

Y[i] = 6.5421 X[i] + 1768.1278 + R[i], R[i] = 0.9897 R[i-


1] + eps[i], eps ~ N(0, 15.2265^2)
(0.4488) (21.6567) (0.0097)
R[501] = 55.7788 (t = 0.793)

WARNING: X and Y do not appear to be cointegrated.

We see that with a t-statistic of 0.793, oil prices and the S&P 500 are not indicated to be
cointegrated. Further information regarding the egcm library can be found here.

http://www.michaeljgrogan.com/ 23
Regression Analysis and Data Structuring Methods
R and Python

ARIMA Model – Stock Price Forecasting Using R


ARIMA (Autoregressive Integrated Moving Average) is a major tool used in time series analysis to
attempt to forecast future values of a variable based on its present value. For this particular
example, we will use a stock price time series of Johnson & Johnson (JNJ) from 2006-2016, and
use the aforementioned model to conduct price forecasting on this time series.

The purpose of ARIMA is to determine the nature of the relationship between our residuals,
which would provide our model with a certain degree of forecasting power. In the first instance,
in order to conduct a time series analysis we must express our dataset in terms of logarithms. If
our data is expressed solely in price terms, then this does not allow for continuous compounding
of returns over time and will give misleading results.

Additionally, we use the acf and pacf (Autocorrelation and Partial Autocorrelation) functions in R
to determine the nature of potential seasonal lags in our model. The autocorrelation and partial
autocorrelation function both measure, to varying degrees, the correlation coefficient between a
series and lags of the variable over time. An autoregressive process is when a time series follows
a particular pattern in that its present value is in some way correlated to its past value(s). For
instance, if we are able to use regression analysis to discern the present value of a variable from
using its past value, then we refer to this as an AR(1) process:

Xt = ß0 + ß1X(t-1) + et

However, there are some instances in which the present value of a variable can be determined
from the past two or three values, which would incorporate an AR(2) or AR(3) process
respectively:

Xt = ß0 + ß1X(t-1) + ß2X(t-2) + et
Xt = ß0 + ß1X(t-1) + ß2X(t-2) + ß3X(t-3) + et

http://www.michaeljgrogan.com/ 24
Regression Analysis and Data Structuring Methods
R and Python

An AR(1) process is characterised by a slow drop in ACF followed by a sudden cutoff in PACF:

Series Without Differencing

First Differenced Series

http://www.michaeljgrogan.com/ 25
Regression Analysis and Data Structuring Methods
R and Python

Dickey-Fuller Test

In order to use an ARIMA model, we first wish to test if our time series is stationary; i.e. do we
have a constant mean, variance and autocorrelation across our time series dataset. For this
purpose, we use the Dickey-Fuller Test. At the 5% level of significance:

H0: Non-stationary series


HA: Stationary series

data: lnstock
Dickey-Fuller = -2.0974, Lag order = 4, p-value = 0.5361
alternative hypothesis: stationary

With a p-value of 0.5361, we cannot reject the null hypothesis of non-stationarity in our series.
Moreover, running the auto.arima function in R for both models returns a random walk with drift;
i.e. ARIMA(0, 1, 0). On this basis, we choose to specify a (0, 1, 0) ARIMA model for both stocks:

ARIMA Output

To generate an ARIMA plot and output by letting R itself decide the appropriate parameters using
ARIMA, we use the auto.arima function as follows:

fitlnstock<-auto.arima(price)

http://www.michaeljgrogan.com/ 26
Regression Analysis and Data Structuring Methods
R and Python

summary(arima(lnstock, order = c(0,1,0)))

Call:
arima(x = lnstock, order = c(0, 1, 0))
sigma^2 estimated as 0.001697: log likelihood = 212.45,
aic = -422.89

Training set error measures:


ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.007250104 0.04103062 0.03222709 0.1625762
0.7816486 0.9927295 0.01810843

We see that ARIMA ultimately diagnoses a random walk with drift for the stock; meaning that the
movements of the stock price are random, but follows a purposeful pattern in the interim.
Indeed, the movements of various financial assets have been found to be random, but typically
follow a random walk with drift meaning that purposeful patterns play out in the short-term
which can be exploited. Note that in an ideal situation, ARIMAX would be employed which
forecasts an ARIMA model by also taking explanatory variables into account. However, in
situations where we wish to forecast a time series based on its past values alone, then ARIMA is a
standard model for doing so.

If we so wish, we can also obtain the exponents of the ln forecasts in order to obtain real price,
i.e:

forecastedvalues_ln=forecast(fitstock,h=10)
forecastedvaluesextracted=as.numeric(forecastedvalues_ln$m
ean)
finalforecastvalues=exp(forecastedvaluesextracted)
finalforecastvalues

Projecting 10 periods out would give us the following forecast in exp terms:

> finalforecastvalues
[1] 119.6778 120.5520 121.4326 122.3196 123.2131 124.1131
125.0197 125.9329 126.8528 127.7794

Ljung-Box Test

While we could potentially use this model to forecast future values for price, an important test
used to validate the findings of the ARIMA model is the Ljung-Box test. Essentially, the test is
being used to determine if the residuals of our time series follow a random pattern, or if there is a
significant degree of non-randomness.

http://www.michaeljgrogan.com/ 27
Regression Analysis and Data Structuring Methods
R and Python

H0: Residuals follow a random pattern


HA: Residuals do not follow a random pattern

To run this test in R, we use the following function:

Box.test(lag=10, fitlnstock$resid,type=”Ljung-Box”)
data: fitlnstock$resid
X-squared = 14.62, df = 10, p-value = 0.1465

From the above, we see that we have an insignificant p-value. This means that there is likely a
high degree of randomness exhibited by our residuals (consistent with a random walk with drift
model) and therefore our ARIMA model is free of autocorrelation.

http://www.michaeljgrogan.com/ 28
Regression Analysis and Data Structuring Methods
R and Python

Implementation of ARIMA Model in Python


Let us now see how to implement an ARIMA model in Python using the pandas and statsmodels
libraries.

1. Load Libraries
Firstly, we load our libraries as standard. The library of major importance in this case is
statsmodels, since we are using this library to calculate the ACF and PACF statistics, and also to
formulate the ARIMA model:

import pandas
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import numpy as np
import math
from statsmodels.tsa.stattools import acf, pacf
import statsmodels.tsa.stattools as ts
from statsmodels.tsa.arima_model import ARIMA

2. Import csv and define “price” variable using pandas

variables = pandas.read_csv(‘jnj.csv’)
price = variables[‘price’]

3. Autocorrelation and Partial Autocorrelation Plots

lnprice=np.log(price)
lnprice
plt.plot(lnprice)
plt.show()
acf_1 = acf(lnprice)[1:20]
plt.plot(acf_1)
plt.show()
test_df = pandas.DataFrame([acf_1]).T
test_df.columns = [‘Pandas Autocorrelation’]
test_df.index += 1
test_df.plot(kind=’bar’)
pacf_1 = pacf(lnprice)[1:20]
plt.plot(pacf_1)
plt.show()
test_df = pandas.DataFrame([pacf_1]).T
test_df.columns = [‘Pandas Partial Autocorrelation’]
test_df.index += 1
test_df.plot(kind=’bar’)
result = ts.adfuller(lnprice, 1)
result

http://www.michaeljgrogan.com/ 29
Regression Analysis and Data Structuring Methods
R and Python

We see that statsmodels produces the autocorrelation and partial autocorrelation plots:

http://www.michaeljgrogan.com/ 30
Regression Analysis and Data Structuring Methods
R and Python

Moreover, we have confirmation that our data follows an AR(1) stationary process (one with a
constant mean, variance, and autocorrelation), for we see that the price plot now shows a
stationary process:

lnprice_diff=lnprice-lnprice.shift()
diff=lnprice_diff.dropna()
acf_1_diff = acf(diff)[1:20]
test_df = pandas.DataFrame([acf_1_diff]).T
test_df.columns = [‘First Difference Autocorrelation’]
test_df.index += 1
test_df.plot(kind=’bar’)
pacf_1_diff = pacf(diff)[1:20]
plt.plot(pacf_1_diff)
plt.show()

4. ARIMA Model Generation

price_matrix=lnprice.as_matrix()
model = ARIMA(price_matrix, order=(0,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())

Using the (0,1,0) configuration, our ARIMA model is generated:

http://www.michaeljgrogan.com/ 31
Regression Analysis and Data Structuring Methods
R and Python

As previously mentioned, our data is in logarithmic format. Since we are analysing stock price,
this format is necessary to account for compounding returns. However, once we have obtained
the forecasts (for seven periods out in this case), then we can obtain the real price forecast by
converting the logarithmic figure to an exponent:

predictions=model_fit.predict(122, 127, typ=’levels’)


predictions
predictionsadjusted=np.exp(predictions)
predictionsadjusted

http://www.michaeljgrogan.com/ 32
Regression Analysis and Data Structuring Methods
R and Python

Data Cleaning and Structuring Tips in R


One of the big issues when it comes to working with data in any context is the issue of data
cleaning and merging of datasets, since it is often the case that you will find yourself having to
collate data across multiple files, and will need to rely on R to carry out functions that you would
normally carry out using commands like VLOOKUP in Excel.

Data Cleaning and Merging Functions

For our example, we have two datasets:

1. sales.csv: This file contains the variables Date, ID (which is Product ID), and Sales. We load this
into R under the name mydata.

2. customers.csv: This file contains the variables ID, Age, and Country. We load this into R under
the name mydata2.

The following are examples of popular techniques employed in R to clean a dataset, along with
how to format variables effectively to facilitate analysis. The below functions work particularly
well with panel datasets, where we have a mixture of cross-sectional and time series data:

1. Storing variables in a data frame:

To start off with a simple example, let us choose the customers dataset. Suppose that we only
wish to include the variables ID and Age in our data. To do this, we define our data frame as
follows:

dataframe<-data.frame(ID,Age)

2. Mimic VLOOKUP by using the merge functions:

Often times, it is necessary to combine two variables from different datasets similar to how
VLOOKUP is used in Excel to join two variables based on certain criteria. In R, this can be done
using the merge function.

http://www.michaeljgrogan.com/ 33
Regression Analysis and Data Structuring Methods
R and Python

For instance, suppose that we wish to link the Date variable in the sales dataset with the Age and
Country variables in the customers dataset – with the ID variable being the common link.
Therefore, we do as follows:

mergeinfo<-merge(mydata[, c("ID", "Sales")],mydata2[, c("ID", "Age", "Country")])

Upon doing this, we see that a new dataset is formed in R joining our chosen variables:

3. Using as.date to format dates and calculate duration


Suppose that we now wish to calculate the number of days between the current date and the
date of sale as listed in the sales file. In order to accomplish this, we can use as.date as follows:

currentdate=as.Date('2016-12-15')
dateinfile=as.Date(Date)
Duration=currentdate-dateinfile

Going back to the example above, suppose that we now wish to combine this duration variable
with the rest of our data.

Hence, we can now combine our new Duration variable with the merge function as above, and
can do this as follows:

durationasdouble=as.double.difftime(Duration, units='days')
updateddataframe=data.frame(ID,Sales,Date,durationasdouble)

http://www.michaeljgrogan.com/ 34
Regression Analysis and Data Structuring Methods
R and Python

updateddataframe

4. grepl: Remove instances of a string from a variables

Let us look to the Country variable. Suppose that we wish to remove all instances of “Greenland”
from our variable. This is accomplished using the grepl command:

countryremoved<-mydata2[!grepl("Greenland", mydata2$Country),]

5. Delete observations using head and tail functions

The head and tail functions can be used if we wish to delete certain observations from a variable,
e.g. Sales. The head function allows us to delete the first 30 rows, while the tail function allows
us to delete the last 30 rows.

When it comes to using a variable edited in this way for calculation purposes, e.g. a regression,
the as.matrix function is also used to convert the variable into matrix format:

Salesminus30days←head(Sales,-30)
X1=as.matrix(Salesminus30days)
X1

http://www.michaeljgrogan.com/ 35
Regression Analysis and Data Structuring Methods
R and Python

Salesplus30days<-tail(Sales,-30)
X2=as.matrix(Salesplus30days)
X2

6. Replicate SUMIF using the "aggregate" function

Let us suppose that we have created the following table as below (created in R, not from
dataset), and want to obtain the sum of web visits and average minutes spent on a website in
any particular period:

In this instance, we can replicate the SUMIF function in Excel (where the values associated with a
specific identifier are summed up) by using the aggregate function in R. This can be done as
follows (where raw_table is the table specified as above):

sumif_table<-aggregate(. ~ names, data=raw_table, sum)


sumif_table

Thus, the values associated with identifiers (in this case, names) are summed up as follows:

http://www.michaeljgrogan.com/ 36

También podría gustarte