Andrew Pua
April 2022
Losing IID conditions: We need to modify our current set of asymptotic tools.
Spurious regressions: Do regressions help in making conditional predictions? Do regressions uncover meaningful relationships?
Unit roots: What happens to regressions when variables are trending?
And many more we do not cover: Take a time series course at some point.
We studied the asymptotic theory in the simple case of one regressor. The theory, with suitable modifications in notation, directly extends to the more general case of having more regressors.
But IID is restrictive for economic data: we have time series, spatial, and panel data.
The main textbook focuses squarely on the time series case.
Suppose you are interested in estimating the parameters of a first-order autoregression or AR(1) process \(Y_{t}=\beta_0^*+\beta_1^* Y_{t-1}+u_{t}\), where \(u_t\) is error from best linear prediction.
To give you a sense of what the data on \(\left\{ Y_{t}\right\}_{t=1}^{n}\) would look like, here are some pictures where \(\beta_0^*=0\) and \(\beta_1^*\) can be 0, 0.5, 0.95, and 1. I assume that \(u_{t}\sim N\left(0,1\right)\) and \(Y_{0}\sim N\left(0,1\right)\).
You will see two plots side-by-side. One is a time-series plot where \(Y_{t}\) is plotted against \(t\) and the other is a scatterplot where \(Y_{t}\) is plotted against \(Y_{t-1}\).
To enhance comparability, I fix the use the set of randomly drawn \(u_{t}\)’s and \(Y_{0}\)’s.
Now, let us evaluate the performance of OLS when we generate multiple “instances” of the first-order autoregression given earlier.
So the Monte Carlo design here is as follows:
A 5% significance level was used for testing the null that \(\beta_1^*\) is equal to the value in the indicated column.
set.seed(20220318)
require(dyn)
reps <- 10^3
mod <- 1
coefs <- matrix(NA, nrow=reps, ncol=4)
SEs <- matrix(NA, nrow=reps, ncol=4)
t.stat <- matrix(NA, nrow=reps, ncol=4)
for (i in 1:reps)
{
y1 <- arima.sim(n = 40*mod, list(order=c(0,0,0)))
y2 <- arima.sim(n = 40*mod, list(order=c(1,0,0), ar = 0.5), innov = y1)
y3 <- arima.sim(n = 40*mod, list(order=c(1,0,0), ar = 0.95), innov = y1)
y4 <- ts(cumsum(y1))
model.y1 <- dyn$lm(y1~lag(y1,-1))
model.y2 <- dyn$lm(y2~lag(y2,-1))
model.y3 <- dyn$lm(y3~lag(y3,-1))
model.y4 <- dyn$lm(y4~lag(y4,-1))
temp.c <- c(coef(model.y1)[2],coef(model.y2)[2],coef(model.y3)[2],coef(model.y4)[2])
temp.d <- sqrt(c(vcov(model.y1)[2,2],vcov(model.y2)[2,2],vcov(model.y3)[2,2],vcov(model.y4)[2,2]))
coefs[i,] <- temp.c
SEs[i,] <- temp.d
t.stat[i,] <- (temp.c-c(0,0.5,0.95,1))/temp.d
}
mean.ols <- colMeans(coefs)
mean.reg.se <- colMeans(SEs)
sd.ols <- apply(coefs, 2, sd)
p.vals <- (2*pnorm(-abs(t.stat)))<0.05
p.vals <- apply(p.vals, 2, mean)
results <- rbind(mean.ols, mean.reg.se, sd.ols, p.vals)
colnames(results) <- c("beta1=0", "beta1=0.5", "beta1=0.95", "beta1=1")
results
## beta1=0 beta1=0.5 beta1=0.95 beta1=1
## mean.ols -0.0268 0.433 0.8314 0.8740
## mean.reg.se 0.1626 0.146 0.0868 0.0746
## sd.ols 0.1595 0.150 0.1115 0.0992
## p.vals 0.0650 0.070 0.1880 0.2920
## beta1=0 beta1=0.5 beta1=0.95 beta1=1
## mean.ols -0.00368 0.4882 0.9247 0.9677
## mean.reg.se 0.07957 0.0694 0.0297 0.0188
## sd.ols 0.07570 0.0663 0.0334 0.0274
## p.vals 0.04300 0.0350 0.0870 0.3010
## beta1=0 beta1=0.5 beta1=0.95 beta1=1
## mean.ols -0.00228 0.4964 0.9445 0.99160
## mean.reg.se 0.03959 0.0344 0.0129 0.00483
## sd.ols 0.03922 0.0342 0.0133 0.00679
## p.vals 0.04900 0.0460 0.0490 0.29300
Dickey and Fuller (1979) have shown that when testing the null of a unit root, the asymptotic distribution of the test statistic under the null is nonstandard.
But their research further indicates that the asymptotic distribution of the test statistic under the null changes depending on the presence or absence of deterministic variables in the autoregression (e.g. time trends, intercepts), and the nature of the null being tested.
For more on the nonstandard behavior in the unit root case, see Chang and Park (2002).
We will rule out this unit root case in our discussions, but we point one more issue related to the unit root case.
Let us talk about another issue with running regressions when the variables are trending. Consider the following two situations:
##
## Call:
## lm(formula = dyn(y1 ~ x1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.588 -0.650 0.004 0.699 3.188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0958 0.0313 3.06 0.0023 **
## x1 0.0282 0.0320 0.88 0.3797
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.988 on 998 degrees of freedom
## Multiple R-squared: 0.000773, Adjusted R-squared: -0.000228
## F-statistic: 0.772 on 1 and 998 DF, p-value: 0.38
##
## Call:
## lm(formula = dyn(y2 ~ x2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.04 -11.37 0.24 10.73 26.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.4113 0.6112 49.8 <2e-16 ***
## x2 0.8107 0.0147 55.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.1 on 998 degrees of freedom
## Multiple R-squared: 0.752, Adjusted R-squared: 0.752
## F-statistic: 3.03e+03 on 1 and 998 DF, p-value: <2e-16
What you have observed in the second case is a phenomenon called spurious regression or “nonsense regression”. A version of this phenomenon was noted by Yule (1926) but pointed out more recently by Granger and Newbold (1974).
Granger and Newbold (1974) also show that measures of fit from spurious regressions will typically indicate very good fit even if the two variables are truly unrelated. This is yet another instance where standard measures of fit like the R-squared have to be interpreted with caution.
Nonsense regressions can also happen in the context of IID data. Try simulating a case where there are many unrelated \(X\)’s included relative to sample size.
There are two broad ways of solving the spurious regression problem:
##
## Call:
## lm(formula = dyn(diff(y2) ~ diff(x2)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.590 -0.650 0.002 0.698 3.187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0973 0.0313 3.11 0.002 **
## diff(x2) 0.0291 0.0320 0.91 0.364
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.987 on 997 degrees of freedom
## Multiple R-squared: 0.000826, Adjusted R-squared: -0.000176
## F-statistic: 0.825 on 1 and 997 DF, p-value: 0.364
One way for \(\mathsf{Var}\left(\overline{Z}\right)\to0\) as \(n\to\infty\) is when \(\mathsf{Var}\left(Z_{t}\right)\) is bounded and when \[\begin{aligned}\left\vert \sum_{j=1}^{n-1}\left(1-\frac{j}{n}\right)\mathsf{Cov}\left(Z_{t},Z_{t-j}\right)]\right\vert & \leq\sum_{j=1}^{n-1}\left\vert \left(1-\frac{j}{n}\right)\right\vert \left\vert \mathsf{Cov}\left(Z_{t},Z_{t-j}\right)\right\vert \\ & \leq \sum_{j=1}^{n-1}\left\vert \mathsf{Cov}\left(Z_{t},Z_{t-j}\right)\right\vert \end{aligned}\] is bounded as \(n\to\infty\).
Thus, under certain conditions on the autocovariances, \(\overline{Z}\overset{qm}{\to}\mathbb{E}\left(Z_{t}\right)\). Hence, \(\overline{Z}\overset{p}{\to}\mathbb{E}\left(Z_{t}\right)\).
What you saw is the simplest version of the ergodic theorem under stationarity.
It is possible to have a slightly complicated version of this ergodic theorem under nonstationarity, see the very accessible note by Shalizi (2022).
One way to look at the processes from the previous slide is to understand how these processes capture predictability.
Recall that if \(\left\{ Z_{t}\right\}\) is an IID sequence, then \(Z_{t}|Z_{t-1},Z_{t-2},\ldots,Z_{1}\sim Z_{t}\).
Compare this unpredictability of MDS.
A CLT can also be developed along similar lines. Recall that under consistency of the sample mean for the population mean, we have \(\mathsf{Var}\left(\overline{Z}\right)\to0\) as \(n\to\infty\).
This means that to derive a distributional result, we have to rescale to ensure that the variance does not disappear as \(n\to\infty\), just like before. In particular, \[\mathsf{Var}\left[\sqrt{n} \left( \overline{Z}-\mathbb{E}\left(Z_t\right)\right)\right]=\mathsf{Var}\left(Z_{t}\right)+2\sum_{j=1}^{n-1}\left(1-\frac{j}{n}\right)\mathsf{Cov}\left(Z_{t},Z_{t-j}\right).\] Once again, we need boundedness conditions on the right hand side as \(n\to\infty\).
So we aim to obtain a CLT that looks like: \[\sqrt{n} \left( \overline{Z}-\mathbb{E}\left(Z_t\right)\right) \overset{d}{\to} N\left(0, V\right),\] where \(V\) is sometimes referred to as the long-run variance.
Again, there are three approaches:
The form of V?
Consider the following standard argument: \[\begin{eqnarray} \sqrt{n}\left(\widehat{\beta}-\beta^{*}\right) = \left(\dfrac{1}{n}\sum_{t=1}^{n}X_{t}X_{t}^{\prime}\right)^{-1}\left(\dfrac{1}{\sqrt{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}u_{t}\right)\\ \overset{d}{\rightarrow} N\bigg(0,\underbrace{Q^{-1}\left[\mathsf{Avar}\left(\dfrac{1}{\sqrt{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}u_{t}\right)\right]Q^{-1}}_{\mathsf{Avar}\left(\sqrt{n}\left(\widehat{\beta}-\beta^{*}\right)\right)}\bigg) \end{eqnarray}\]
In the proof, we would need:
In the MDS case: \[\dfrac{1}{n}{\displaystyle \sum_{t=1}^{n}}X_{t}X_{t}^{\prime}\widehat{u}_{t}^{2}\overset{p}{\rightarrow}\lim_{n\rightarrow\infty}\mathsf{Var}\left(\dfrac{1}{\sqrt{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}u_{t}\right)\]
In the non-MDS case:\[\dfrac{1}{n}{\displaystyle \sum_{j=-p_{n}}^{p_{n}}}k\left(\frac{j}{p_{n}}\right)\widehat{\Gamma}\left(j\right)\overset{p}{\rightarrow}\lim_{n\rightarrow\infty}\mathsf{Var}\left(\dfrac{1}{\sqrt{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}u_{t}\right),\] where \(\widehat{\Gamma}\left(j\right)\) consistently estimates \(\Gamma\left(j\right)\), \(k\left(\cdot\right)\) is a user-specified kernel function, and \(p_{n}\) is a user-specified bandwidth.
A naive estimator of the “meat” could have been \[\sum_{j=-(n-1)}^{n-1}\widehat{\Gamma}\left(j\right)=\widehat{\Gamma}\left(0\right)+\sum_{j=1}^{n-1}\left[\widehat{\Gamma}\left(j\right)+\widehat{\Gamma}\left(j\right)^{\prime}\right].\]
In effect, you make the limits of the summation in long-run variance expression finite.
But you have to ask yourself how many observations are used to estimate \(\widehat{\Gamma}\left(j\right)\), say for \(j=0\) and for \(j=n-1\). You will realize that this estimator may be too naive and will be subject to a lot of estimation error.
You may have prior information that \(\Gamma\left(j\right)=0\) for all \(j>p\), where \(p\) is known, small-ish, and finite.
You may need to “trim” the estimator according to some rule and do some reweighting.
But the past 20-30 years have shown that we might have to be cautious of the previous approach.
The underlying idea is to account in the asymptotic theory for the reality that bandwidths and reweighting schemes are hard to specify in advance. Another idea is to change the reweighting schemes a bit: use a series estimator instead of a kernel estimator.
Default rules in software are available under extra assumptions about user preferences, which sometimes user’s may not even be aware of!
HAR inference is still an ongoing field of research and many are trying to make it useful for practitioners. A very illustrative example is by Lazarus, Lewis, Stock, and Watson (2018) and all the published discussions of the article.
But some old ideas are starting to get resurrected. Model the heteroscedasticity and autocorrelation more directly. The idea is to combine generalized least squares and HC/HAC standard errors.
Recall that random variables are mappings from the sample space \(\Omega\) to \(\mathbb{R}\) (or more generally \(\mathbb{R}^{n}\)).
Stochastic processes are mappings of the form \(Z:T\times\Omega\to\mathbb{R}\) (or more generally mapped to \(\mathbb{R}^{n}\)), where \(T\) is some index set (compare Defintion 5.1 of main textbook).
Stochastic processes embody a “parallel universes” extension of random sampling to the time series case.
Let \(t\in \mathbb{Z}\).
An SP \(\left\{ Z_{t}\right\} _{t=1}^{\infty}\) is strictly stationary if, for any given finite integer \(k\) and for any set of subscripts \(t_{1},t_{2},\ldots,t_{m}\), the joint distribution of \(\left(Z_{t_{1}},Z_{t_{2}},\ldots,Z_{t_{m}}\right)\) is the same as the joint distribution of \(\left(Z_{t_{1}+k},Z_{t_{2}+k},\ldots,Z_{t_{m}+k}\right)\).
An SP \(\left\{ Z_{t}\right\}_{t=1}^{\infty}\) is weakly stationary or covariance stationary or second-order stationary if
Compare and contrast these two stationarity concepts.
Determine which of the processes in the examples are strictly stationary, weakly stationary, or neither.
Martingales embody the idea of “no anticipated changes” given all past information. The efficient markets hypothesis and the consumption smoothing hypothesis are statements about economic quantities that behave like martingales.
By definition, \(\left\{ Z_{t}\right\}\) is a martingale if \(\mathbb{E}\left(Z_{t}|Z_{t-1},Z_{t-2},\ldots\right)=Z_{t-1}\).
So, where does the idea of “no anticipated changes” show up in the definition of a martingale?
Since best prediction of \(Z_{t}\) given all available past information is its most recent value, then the CEF \(\mathbb{E}\left(Z_t|I_{t-1}\right)\) is equal to the best linear predictor \(\beta_0^*+\beta_1^* Z_{t-1}+\beta_2^* Z_{t-2} + \cdots\), where \(\beta_0^*=0\), \(\beta_1^*=1\), \(\beta_j^*=0\) for all \(j\geq 2\).
Another way to see “no anticipated changes” is through the definition of a martingale difference sequence (MDS).
Note that \(Z_t=\beta_0^*+\beta_1^* Z_{t-1}+\beta_2^* Z_{t-2} + \cdots+\varepsilon_t\) where \(\mathbb{E}\left(\varepsilon_t|I_{t-1}\right)=0\).
The CEF error \(\varepsilon_t\) is actually an MDS.
Clearly, an MDS does not have serial correlation.
Suppose you have \(Y_t=X_t^\prime \beta^o+\varepsilon_t\). Assume that \(\{\left(Y_t,X_t^\prime\right)\}\) are realizations from an ergodic stationary process. Assume that the relevant moments exist.
We can further improve the CLT for ergodic stationary MDS to allow for some serial correlation. But as you have seen in our discussion of the asymptotic behavior of the sample mean, we need to control the behavior of certain autocovariances.
Here is the setup:
To ensure that the asymptotic variance of \(\sqrt{n}\ \overline{Z}\) is finite, we need to make sure that \[\mathsf{Var}\left(\sqrt{n}\ \overline{Z}_{n}\right)=\mathsf{Var}\left(Z_{t}\right)+2\sum_{j=1}^{n-1}\left(1-\frac{j}{n}\right)\mathsf{Cov}\left(Z_{t},Z_{t-j}\right)\] is finite.
The last condition \[{\displaystyle \sum_{j=0}^{\infty}\left[\mathbb{E}\left(r_{t,j}^{\prime}r_{t,j}\right)\right]^{1/2}<\infty}\] guarantees this.
Suppose the following conditions hold:
Then, as \(n\to\infty\), \[\sqrt{n}\left(\frac{1}{n}\sum_{t=1}^{n}Z_{t}\right)\overset{d}{\rightarrow}N\left(0,V\right).\]
My suggestion is to focus on the following exercises: