Motivation: What do you gain from correct specification?

Stronger and simpler-looking results
- With correct specification, unbiasedness of the least squares estimator is possible.
- Unbiasedness is a finite-sample property.
- Correct specification could be a good benchmark to establish results which could enable assessment of departures from correct specification.
- For the normal linear regression model, we saw an assumption of the form $\mathsf{Var}\left(Y_t|X_{11},\ldots,X_{1n}\right)=\sigma^2$ . This simplifies the asymptotic variance for the slope.
But it might come at great cost!
- Correct specification is hard to establish and hard to motivate. Hence, subject-matter knowledge is really important.
- Once we have complicated models, unbiasedness becomes harder to achieve.
- A lot of machine learning methods people use in practice are biased!
- Maximum likelihood estimation is biased! For example, estimating $\lambda$ from a random sample $X_1,\ldots,X_n\sim Exp\left(\lambda\right)$ .

Illustration: Theory behind Mankiw, Romer, and Weil (1992)

I discuss an empirical application based on Mankiw, Romer, and Weil (1992).
It is about estimating the Solow growth model.
Assumptions on the theoretical model:
- Rates of saving $s$ , population growth $n$ , technological progress $g$ , and depreciation $\delta$ are exogenous and constant.
- One output $Y\left(t\right)$ ; two inputs, capital $K\left(t\right)$ and labor $L\left(t\right)$
- Cobb Douglas technology $Y\left(t\right)=\left[K\left(t\right)\right]^{\alpha}\left[A\left(t\right)L\left(t\right)\right]^{1-\alpha}$
- Competitive labor and capital markets
Let $k\left(t\right)=K\left(t\right)/A\left(t\right)L\left(t\right)$ . The evolution of $k$ is given by $\dot{k}\left(t\right)=\dfrac{dk\left(t\right)}{dt}=sy\left(t\right)-\left(n+g+\delta\right)k\left(t\right)$ .
In the steady state, income per capita (Y/L) is given by $\log y^{*}=\log A\left(0\right)+gt^{*}+\dfrac{\alpha}{1-\alpha}\log s-\dfrac{\alpha}{1-\alpha}\log\left(n+g+\delta\right).$

Illustration: Econometrics behind Mankiw, Romer, and Weil (1992)

Use data across countries at a specific point in time. Denote each country by index $i$ .
Assume that $g$ and $\delta$ are constant across countries. Depreciation rates do not vary greatly across countries. (Justifications found in page 410)
Assume that $\log A\left(0\right)_{i}=a+\varepsilon_{i}$ . Thus, $\log y_{i}^{*} = a+gt^{*}+\dfrac{\alpha}{1-\alpha}\log s_{i}-\dfrac{\alpha}{1-\alpha}\log\left(n_{i}+g+\delta\right)+\varepsilon_{i}.$
MRW assume that $s_{i}$ and $n_{i}$ are independent of $\varepsilon_{i}$ . (Economic justifications are given in page 411.)
Is there an econometric justification for imposing independence?
- Recall our exercise on elasticities.

Illustration: Data used by Mankiw, Romer, and Weil (1992)

Cross-sectional dataset
Variables:
- average rate of growth of working-age (15-64) population $n_{i}$
- average share of real investment in real GDP $s_{i}$
- real GDP in 1985 divided by the working-age population in 1985 $y_{i}^{*}$
Three samples of countries: 98 non-oil countries, 75 countries with better quality data (intermediate sample), 22 OECD countries with populations greater than 1 million
Assume that $g+\delta=0.05$ .

Replicating Table I of MRW (1992)

The model estimated was $\log y_{i}^{*} = \beta_{0}+\beta_{1}\log s_{i}+\beta_{2}\log\left(n_{i}+g+\delta\right)+\varepsilon_{i}.$

options(digits=3) # Learn to present to the appropriate precision
require(haven) # Need this package to load Stata datasets

## Loading required package: haven

MRW <- read_dta("./MRW.dta") # Load Stata dataset
# Generate new variables
MRW$ly85 <- log(MRW$y85)
MRW$linv <- log(MRW$inv/100)
MRW$lpop <- log(MRW$pop/100 + 0.05)
MRW.TableI.nonoil <- lm(ly85 ~ linv + lpop, data = subset(MRW, MRW$n==1)) # Apply OLS
coef.TableI.nonoil <- coefficients(MRW.TableI.nonoil) # Extract coefficients
coef.TableI.nonoil

## (Intercept)        linv        lpop 
##        5.43        1.42       -1.99

Interpreting coefficients: Dangers

Calculating and interpreting coefficients are separate tasks.
Here is an example adapted from Gelman, Hill, and Vehtari (2021):

set.seed(20220312)
n <- 1000
true_ability <- rnorm(n, 50, 10)
noise_1 <- rnorm(n, 0, 10)
noise_2 <- rnorm(n, 0, 10)
midterm <- true_ability + noise_1
final <- true_ability + noise_2
lm(final ~ midterm)

## 
## Call:
## lm(formula = final ~ midterm)
## 
## Coefficients:
## (Intercept)      midterm  
##      22.666        0.541

Calculating regression coefficients is easy from lm(). But interpreting the results is harder. Do you see something strange from the results?

Interpreting coefficients: Location-scale transformations

Let $Y_{new}=c_1+c_2 Y$ and $X_{1,new}=d_1+d_2 X_1$ (scalar). Let $c_1$ , $c_2\neq 0$ , $d_1$ , $d_2\neq 0$ be constants.
You can always write $Y_{new}=\beta_{0,new}^*+\beta_{1,new}^*X_{1,new}+u_{new}.$ But how are the new coefficients related to the original coefficients?
Start with $\begin{aligned} Y &=\beta_{0}^*+\beta_{1}^*X_1+u \\ \frac{Y_{new}-c_1}{c_2}&= \beta_{0}^*+\beta_{1}^*\left(\frac{X_{1,new}-d_1}{d_2}\right)+u \\ Y_{new} &= \left(c_2\beta_0^*+c_1-c_2\frac{\beta_1^*d_1}{d_2}\right)+c_2\frac{\beta_1^*}{d_2}X_{1,new}+c_2 u.\end{aligned}$

Interpreting coefficients: Interaction terms

Now let us look at interaction terms and what they are useful for.
Let $Y=\beta_0^*+\beta_1^* X_1+\beta_2^* X_2+\beta_3^* X_1X_2+u$ . Assume $X_1$ is a binary/dummy variable and $X_2$ is continuous.
Consider the following comparisons where we look at the best linear prediction of $Y$ given:
- $X_{1}=0,X_{2}=0$ gives $\beta_{0}^{*}$
- $X_{1}=1,X_{2}=0$ versus $X_{1}=0,X_{2}=0$ gives $\beta_{1}^{*}$
- $X_{1}=0,X_{2}=x+1$ versus $X_{1}=0,X_{2}=x$ gives $\beta_{2}^{*}$
- ( $X_{1}=1,X_{2}=x+1$ versus $X_{1}=1,X_{2}=x$ ) versus ( $X_{1}=0,X_{2}=x+1$ versus $X_{1}=0,X_{2}=x$ ) gives $\beta_3^*$
The last comparison is sometimes used to make causal statements in a differences-in-differences framework. Note that at the moment, we have a predictive comparison only.

Interpreting coefficients: Powers

How about powers like squares or cubes?
Let $Y=\beta_0^*+\beta_1^* X_1+\beta_2^* X_1^2+u$ . Assume $X_1$ is continuous.
Compare the best linear prediction of $Y$ when $X_1=x+\Delta x$ against the best linear prediction of $Y$ when $X_1=x$ . Then the difference is $\beta_1^* \Delta x + \beta_2^* \left(2x\Delta x+ \left(\Delta x\right)^2\right).$
On a per unit difference basis, we can write $\begin{aligned}\dfrac{\beta_1^* \Delta x + \beta_2^* \left(2x\Delta x+ \left(\Delta x\right)^2\right)}{\Delta x} \overset{\Delta x\to 0}{\approx} = \beta_1^*+2\beta_2^*x.\end{aligned}$

Interpreting coefficients: Centering as a special case of location-scale transformations

In the model with interactions or with powers, it might be a good idea to do some centering.
Consider once again $Y=\beta_0^*+\beta_1^* X_1+\beta_2^* X_1^2+u$ . Assume $X_1$ is continuous. Define $X_{1,new}=X_1-\mathbb{E}\left(X_1\right)$ .
You can write $Y=\beta_0^*+\beta_1^* X_1+\beta_2^* X_1^2+u$ as $Y=\underbrace{\beta_0^*+\beta_1^* \mathbb{E}\left(X_1\right)+ \beta_2^* \left(\mathbb{E}\left(X_1\right)\right)^2}_{\beta_{0,new}^*} + \underbrace{\left(\beta_1^*+2\beta_2^*\mathbb{E}\left(X_1\right)\right)}_{\beta_{1,new}^*} X_{1,new} +\underbrace{\beta_2^*}_{\beta_{2,new}^*} X_{1,new}^2+u$
What do you notice about $\beta_{2,new}^*$ ?
Try conducting a similar analysis if you have interaction terms.

Interpreting coefficients: logarithms

Let $Y=\beta_0^*+\beta_1^* \log X_1+u$ .
Then the predicted difference in $Y$ between a group whose $X_1=x$ and another group whose $X_1=x+\Delta x$ is given approximately by $\beta_{1}^{*}\left(\dfrac{\Delta x}{x}\right)$ .
Another way to say this is that the predicted difference in $Y$ between two groups whose $X_1$ differs by $100\dfrac{\Delta x}{x}$ percent is approximately $\beta_{1}^{*}/100$ times the percentage difference in their $X_1$ ’s.
To see this, you may need a first-order Taylor series approximation, when $\Delta x$ is small ( $\Delta x\approx0$ ): $\log\left(x+\Delta x\right)\approx\log x+\frac{1}{x}\left(x+\Delta x-x\right)=\log x+\dfrac{\Delta x}{x}.$
Things get more complicated when there is a log transform for $Y$ .

Interpreting coefficients: logarithms, continued

Let $\log Y=\beta_{0}^{*}+\beta_{1}^{*}X_1+u$ .
Clearly, the predicted difference in $\log Y$ for units whose $X_1$ differ by 1 is given by $\beta_1^*$ .
But given the previous slide, we would hope to interpret this as a predicted percentage difference in $Y$ .
You have seen in a previous exercise that we have to make more assumptions on $u$ to produce this alternative interpretation. Revisit this exercise and determine under what additional conditions would you be able to have this alternative interpretation.
Conduct a similar analysis when $\log Y=\beta_{0}^{*}+\beta_{1}^{*}\log X_1+u$ .
- You might need a first-order Taylor series approximation $\left(x+\Delta x\right)^{\beta_{1}^{*}}\approx x^{\beta_{1}^{*}}+\beta_{1}^{*}x^{\beta_{1}^{*}-1}\Delta x.$
- Connect to the MRW empirical application.

Setting everything up as matrices

Setup: $\underset{\left(n\times1\right)}{Y}=\left(\begin{array}{c} Y_{1}\\ \vdots\\ Y_{n} \end{array}\right),\underset{\left(n\times\left(k+1\right)\right)}{\boldsymbol{\mathrm{X}}}=\left(\begin{array}{c} X_{1}^{\prime}\\ \vdots\\ X_{n}^{\prime} \end{array}\right)=\left(\begin{array}{cccc} 1 & {X_{11}} & \cdots & {X_{k1}}\\ \vdots & \vdots & \ddots & \vdots\\ 1 & {X_{1n}} & \cdots & {X_{kn}} \end{array}\right),\underset{\left(\left(k+1\right)\times1\right)}{\beta}=\left(\begin{array}{c} \beta_{0}\\ \vdots\\ \beta_{k} \end{array}\right)$
Least squares objective function: $\begin{eqnarray}\min_{\beta}\dfrac{1}{n}\left(Y-\boldsymbol{\mathrm{X}}\beta\right)^{\prime}\left(Y-\boldsymbol{\mathrm{X}}\beta\right) &=& \min_{\beta}\dfrac{1}{n}\sum_{t=1}^{n}\left(Y_{t}-X_{t}^{\prime}\beta\right)^{2} \\ &=& \min_{\beta}\dfrac{1}{n}\sum_{t=1}^{n}\left(Y_{t}-\beta_{0}-\beta_{1}X_{1t}-\cdots-\beta_{k}X_{kt}\right)^{2}\end{eqnarray}$

Solving the LS problem

Solution to minimization problem or least squares minimizer or OLS estimator: minimizer: $\widehat{\beta}=\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\left(\boldsymbol{\mathrm{X}}^{\prime}Y\right)=\left({\displaystyle \frac{1}{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}X_{t}^{\prime}\right)^{-1}\left({\displaystyle \frac{1}{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}Y_{t}\right)$
To obtain the LS minimizer, use calculus or matrix calculus, depending on your level of comfort. But avoid this, if possible. Use orthogonality!
The system of equations or the first-order conditions of the minimization problem are given by $\begin{eqnarray}\dfrac{1}{n}\sum_{t=1}^{n}2\left(Y_{t}-\widehat{\beta}_{0}-\widehat{\beta}_{1}X_{1t}-\cdots-\widehat{\beta}_{k}X_{kt}\right)\left(-1\right) &=& 0 \\ \vdots &=& \vdots\\ \dfrac{1}{n}\sum_{t=1}^{n}2\left(Y_{t}-\widehat{\beta}_{0}-\widehat{\beta}_{1}X_{1t}-\cdots-\widehat{\beta}_{k}X_{kt}\right)\left(-X_{kt}\right) &=& 0\end{eqnarray}$
See how it is related to orthogonality from before?

LS minimizer

That system of equations can be further compactly rewritten as $\begin{eqnarray}\dfrac{1}{n}\sum_{t=1}^{n}2\left(Y_{t}-\widehat{\beta}_{0}-\widehat{\beta}_{1}X_{1t}-\cdots-\widehat{\beta}_{k}X_{kt}\right)\left(-X_{t}\right) &=& 0\\ \Rightarrow\dfrac{1}{n}\sum_{t=1}^{n}2\left(Y_{t}-X_{t}^{\prime}\widehat{\beta}\right)\left(-X_{t}\right) &=& 0\\ \Rightarrow\dfrac{1}{n}\sum_{t=1}^{n}\left(X_{t}Y_{t}-X_{t}X_{t}^{\prime}\widehat{\beta}\right) &=& 0\\ \Rightarrow\left({\displaystyle \frac{1}{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}X_{t}^{\prime}\right)\widehat{\beta} &=& \left({\displaystyle \frac{1}{n}}{\displaystyle \sum_{t=1}^{n}}X_{t}Y_{t}\right)\end{eqnarray}$

Algebraic properties of LS minimizer

The vector of fitted values: $\widehat{Y}=\boldsymbol{\mathrm{X}}\widehat{\beta}=\boldsymbol{\mathrm{X}}\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\left(\boldsymbol{\mathrm{X}}^{\prime}Y\right)=P_{\boldsymbol{\mathrm{X}}}Y$
The vector of residuals: $e=Y-\boldsymbol{\mathrm{X}}\widehat{\beta}=Y-\boldsymbol{\mathrm{X}}\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\left(\boldsymbol{\mathrm{X}}^{\prime}Y\right)=Y-\boldsymbol{\mathrm{X}}\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}Y=\left(I-P_{\boldsymbol{\mathrm{X}}}\right)Y=M_{\boldsymbol{\mathrm{X}}}Y$
$e$ is orthogonal to the columns of $\boldsymbol{\mathrm{X}}$ , i.e., $\boldsymbol{\mathrm{X}}^{\prime}e=\boldsymbol{0}$ .
- If $\boldsymbol{\mathrm{X}}$ has a column of ones, then $\iota^{\prime}e=0$ .
- $\boldsymbol{e}$ is orthogonal to $\boldsymbol{\mathrm{X}}\widehat{\beta}$ .
- $\left\Vert Y\right\Vert ^{2}=\left\Vert \boldsymbol{\boldsymbol{\mathrm{X}}}\widehat{\beta}\right\Vert ^{2}+\left\Vert e\right\Vert ^{2}$
Assume that $\boldsymbol{\mathrm{X}}$ has rank $k+1\leq n$ . If there is another $\widetilde{\beta}$ such that $Y-\boldsymbol{\mathrm{X}}\widetilde{\beta}$ is orthogonal to $\boldsymbol{\mathrm{X}}$ , then $\widetilde{\beta}=\widehat{\beta}$ .

Exercises

What are the corresponding matrices for the MRW case?
Always do a sanity check when you see matrices. Check the dimensions!
Think about cases where $k=0$ and $k=1$ . You may find a vector of ones, denoted by $\iota$ , to be useful. Do you obtain some familiar objects?
$P_{\boldsymbol{\mathrm{X}}}$ and $M_{\boldsymbol{\mathrm{X}}}$ are projection matrices, called the hat-maker and the residual-maker (or the annihilator), respectively.
- A special annihilator is $M_{\iota}$ where $\iota$ is a vector of ones. What does this annihilator do to a vector?
- Show that both the hat-maker and residual-maker are symmetric and idempotent.
Show that $\boldsymbol{\mathrm{X}}$ has rank $k+1\leq n$ if and only if $\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}$ is invertible.

Measures of fit

We can measure “fit” in terms of the uncentered R-squared, defined as $R_{uc}^{2}=\dfrac{\widehat{\beta}^{\prime}\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\widehat{\beta}}{Y^{\prime}Y}=\dfrac{\left\Vert \boldsymbol{\boldsymbol{\mathrm{X}}}\widehat{\beta}\right\Vert ^{2}}{\left\Vert Y\right\Vert ^{2}}=1-\dfrac{\left\Vert e\right\Vert ^{2}}{\left\Vert Y\right\Vert ^{2}}$
But this uncentered R-squared has problems. To avoid these problems, a centered R-squared is typically used.
- The objective function evaluated at the LS minimizer is called the sum of squared residuals (SSR).
- The total sum of squares (SST) or the total variation in $Y$ is given by $\sum_{t=1}^{n}\left(Y_{t}-\overline{Y}\right)^{2}=Y^{\prime}M_{\iota}Y$ .
- Let us decompose the SST: $\begin{eqnarray}Y^{\prime}M_{\iota}Y &=& \left(\boldsymbol{\mathrm{X}}\widehat{\beta}+e\right)^{\prime}M_{\iota}\left(\boldsymbol{\mathrm{X}}\widehat{\beta}+e\right) \\ &=& \widehat{\beta}^{\prime}\boldsymbol{\mathrm{X}}^{\prime}M_{\iota}\boldsymbol{\mathrm{X}}\widehat{\beta}+\widehat{\beta}^{\prime}\boldsymbol{\mathrm{X}}^{\prime}M_{\iota}e+e^{\prime}M_{\iota}\boldsymbol{\mathrm{X}}\widehat{\beta}+e^{\prime}e \\ &=& \underbrace{\widehat{\beta}^{\prime}\boldsymbol{\mathrm{X}}^{\prime}M_{\iota}\boldsymbol{\mathrm{X}}\widehat{\beta}}_{SSE}+\underbrace{e^{\prime}e}_{SSR}\end{eqnarray}$

Measures of fit, continued

SSE is called explained sum of squares (SSE).
- To make sense of the matrix form of SSE: $\widehat{\beta}^{\prime}\boldsymbol{\mathrm{X}}^{\prime}M_{\iota}\boldsymbol{\mathrm{X}}\widehat{\beta}=\widehat{Y}^{\prime}M_{\iota}\widehat{Y}=\sum_{t=1}^{n}\left(\widehat{Y}_{t}-\overline{Y}\right)^{2}.$
Therefore, the centered R-squared, denoted by $R_{c}^{2}$ or simply $R^{2}$ , is defined as $R_{c}^{2}=\dfrac{\widehat{\beta}^{\prime}\boldsymbol{\mathrm{X}}^{\prime}M_{\iota}\boldsymbol{\mathrm{X}}\widehat{\beta}}{Y^{\prime}M_{\iota}Y}=\dfrac{\left\Vert M_{\iota}\boldsymbol{\boldsymbol{\mathrm{X}}}\widehat{\beta}\right\Vert ^{2}}{\left\Vert M_{\iota}Y\right\Vert ^{2}}=1-\dfrac{\left\Vert e\right\Vert ^{2}}{\left\Vert M_{\iota}Y\right\Vert ^{2}}$ Do you understand now why it is centered?
Note that $R^{2}$ increases when you put more columns in $\mathbf{X}$ .

Exercises

(Stachurski) Here is a series of properties of the R-squared.

Show that the uncentered R-squared is not affected by changing $Y$ to $\alpha Y$ , where $\alpha\neq 0$ is a scalar. Start with $R_{uc,\alpha}^{2}\overset{?}{=}\left\Vert P_{\mathbf{X}}\left(\alpha Y\right)\right\Vert ^{2}/\left\Vert \left(\alpha Y\right)\right\Vert ^{2}\overset{?}{=}\cdots\overset{?}{=}R_{uc}^{2}.$
Show that the uncentered R-squared is affected by changing $Y$ to $Y+\alpha\iota$ , where $\iota$ is a vector of ones and $\alpha\neq0$ is a scalar. In particular, let $\mathbf{X}$ contain a column of ones, then we have (justify each step) $R_{uc,\alpha}^{2}\overset{?}{=}\dfrac{\left\Vert P_{\mathbf{X}}\left(Y+\alpha\iota\right)\right\Vert ^{2}}{\left\Vert Y+\alpha\iota\right\Vert ^{2}}\overset{?}{=}\dfrac{\left\Vert P_{\mathbf{X}}Y+\alpha P_{\mathbf{X}}\iota\right\Vert ^{2}}{\left\Vert Y+\alpha\iota\right\Vert ^{2}}\overset{?}{=}\dfrac{\left\Vert P_{\mathbf{X}}Y+\alpha\iota\right\Vert ^{2}}{\left\Vert Y+\alpha\iota\right\Vert ^{2}}\overset{?}{=}\dfrac{\left\Vert P_{\mathbf{X}}\left(Y/\alpha\right)+\iota\right\Vert ^{2}}{\left\Vert \left(Y/\alpha\right)+\iota\right\Vert ^{2}}.$
- Now let $\alpha\to\infty$ , what happens to $R_{uc,\alpha}^{2}$ ?
- To justify one of the steps above, you need to convince yourself that $P_{\mathbf{X}}\iota=\iota$ if $\mathbf{X}$ contain a column of ones. To show this, do NOT use the matrix form of $P_{\mathbf{X}}$ . Better to use the fact that $P_{\mathbf{X}}\mathbf{X}=\mathbf{X}$ and write out what $\mathbf{X}$ looks like. This is far easier than it seems, so be patient and truly understand what matrix multiplication is doing.

Frisch-Waugh-Lovell (FWL) Theorem

You can think of FWL as the regression anatomy from an algebraic perspective.
- Strictly speaking, there is no need for a model for FWL to work!
Partition $\boldsymbol{\mathrm{X}}$ as $\boldsymbol{\mathrm{X}}=\left(\begin{array}{cc} \boldsymbol{\mathrm{X}}_{-k}\begin{array}{c} \vert\end{array} & \boldsymbol{\mathrm{X}}_{k}\end{array}\right)$ , where $\boldsymbol{\mathrm{X}}_{k}$ is a column vector. Assume that $\boldsymbol{\mathrm{X}}$ has rank $k+1\leq n$ .
- Apply least squares to (a) $Y$ on $\boldsymbol{\mathrm{X}}_{-k}$ and (b) $\boldsymbol{\mathrm{X}}_{k}$ on $\boldsymbol{\mathrm{X}}_{-k}$ .
- You will obtain the following:
  - LS minimizer $\widehat{\gamma}=\left(\boldsymbol{\mathrm{X}}_{-k}^{\prime}\boldsymbol{\mathrm{X}}_{-k}\right)^{-1}\left(\boldsymbol{\mathrm{X}}_{-k}^{\prime}Y\right)$ and LS decomposition $Y=\boldsymbol{\mathrm{X}}_{-k}\widehat{\gamma}+\widehat{\eta}$
  - LS minimizer $\widehat{\delta}=\left(\boldsymbol{\mathrm{X}}_{-k}^{\prime}\boldsymbol{\mathrm{X}}_{-k}\right)^{-1}\left(\boldsymbol{\mathrm{X}}_{-k}^{\prime}\boldsymbol{\mathrm{X}}_{k}\right)$ and LS decomposition $\boldsymbol{\mathrm{X}}_{k}=\boldsymbol{\mathrm{X}}_{-k}\widehat{\delta}+\widehat{v}$
FWL states that:
- If you apply least squares to $\widehat{\eta}$ on $\widehat{v}$ without a constant, then you obtain $\widehat{\beta}_k$ .
- The residual obtained from the previous least squares fit is exactly equal to $e$ .

The multicollinearity problem

$\widehat{\beta}$ would not be unique when $\boldsymbol{\mathrm{X}}$ has rank less than $k+1$ .
- This latter situation is called a rank deficiency.
- Some also call this situation the perfect or exact multicollinearity.
To even talk about the LS minimizer, you need to assume that there exists no perfect multicollinearity (which is equivalent to nonsingularity of $\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}})$ .
- This means that you rule out the situation where one of the columns of $\boldsymbol{\mathrm{X}}$ is an exact linear function of the other columns.
- An example of perfect multicollinearity is the so-called dummy variable trap.
What is the solution to perfect multicollinearity? Drop those offending columns or rewrite so that there would no perfect multicollinearity.
What happens when there is some multicollinearity but not perfect multicollinearity? Then $\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}$ is nonsingular and the LS minimizer exists.
From an algebraic perspective, the real worry is computational stability. The LS minimizer exists but is subject to numerical difficulties.

Least squares with linear equality constraints or restrictions

Setup: $Y$ , $\boldsymbol{\mathrm{X}}$ , $\beta$ same as before, now add $\underset{\left(J\times\left(k+1\right)\right)}{R}$ and $\underset{\left(J\times1\right)}{r}$ both nonstochastic, with $\mathsf{rank}\left(R\right)=J\leq k+1$ .
Use when? When there are linear equality constraints on $\beta$ . User must specify $R$ and $r$ in advance.
Restricted least squares (RLS) objective function and constraints: $\min_{\beta}\left(Y-\boldsymbol{\mathrm{X}}\beta\right)^{\prime}\left(Y-\boldsymbol{\mathrm{X}}\beta\right)\ \ s.t.\ R\beta=r$
Lagrangian: $\min_{\beta,\lambda}\left(Y-\boldsymbol{\mathrm{X}}\beta\right)^{\prime}\left(Y-\boldsymbol{\mathrm{X}}\beta\right)+2\lambda^{\prime}\left(r-R\beta\right)$

Exercises

Show that the following first-order conditions hold for the constrained problem: Let $\left(\widetilde{\beta},\widetilde{\lambda}\right)$ be an optimal solution. Then, $\begin{eqnarray}-2\boldsymbol{\mathrm{X}}^{\prime}\left(Y-\boldsymbol{\mathrm{X}}\widetilde{\beta}\right)-2R^{\prime}\widetilde{\lambda} &=& 0,\\ r-R\widetilde{\beta} &=& 0.\end{eqnarray}$
To find closed form expressions for $\left(\widetilde{\beta},\widetilde{\lambda}\right)$ , show that $\begin{eqnarray}-2\boldsymbol{\mathrm{X}}^{\prime}\left(Y-\boldsymbol{\mathrm{X}}\widetilde{\beta}\right)-2R^{\prime}\widetilde{\lambda} = 0 \\ \Rightarrow\widetilde{\beta} = \left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\left(\boldsymbol{\mathrm{X}}^{\prime}Y\right)+{\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}}R^{\prime}\widetilde{\lambda}=\widehat{\beta}+{\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}}R^{\prime}\widetilde{\lambda}\end{eqnarray}$ and that have $-2\boldsymbol{\mathrm{X}}^{\prime}\left(Y-\boldsymbol{\mathrm{X}}\widetilde{\beta}\right)=2R^{\prime}\widetilde{\lambda}\Rightarrow R\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}\left(Y-\boldsymbol{\mathrm{X}}\widetilde{\beta}\right)=R\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}R^{\prime}\widetilde{\lambda}.$
If $R\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}R^{\prime}$ is nonsingular, show that $\widetilde{\lambda}=\left[R\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}R^{\prime}\right]^{-1}\left(R\widehat{\beta}-R\widetilde{\beta}\right)=\left[R\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}R^{\prime}\right]^{-1}\left(R\widehat{\beta}-r\right).$

Exercises, continued

Show that $\widetilde{\beta}=\widehat{\beta}-{\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}}R^{\prime}\left[R\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}R^{\prime}\right]^{-1}\left(R\widehat{\beta}-r\right).$
Intuitively, the SSR for LS should be less than the SSR for RLS, i.e., you should expect that $\left\Vert e\right\Vert ^{2}$ would be smaller than $\left\Vert \widetilde{e}\right\Vert ^{2}$ .
- Can you explain this intuitively?
- To prove this, use the “add-subtract” trick again. Let $\widetilde{e}$ be the residual for RLS solution $\widetilde{\beta}$ . Show that $\widetilde{e}$ can be expressed in terms of the LS residual $e$ : $\widetilde{e}=e+\boldsymbol{\mathrm{X}}\left(\widehat{\beta}-\widetilde{\beta}\right).$ Furthermore, show that $\left\Vert \widetilde{e}\right\Vert ^{2}=\widetilde{e}^{\prime}\widetilde{e}=\underbrace{e^{\prime}e}_{\left\Vert e\right\Vert ^{2}}+\underbrace{e^{\prime}\boldsymbol{\mathrm{X}}}_{?}\left(\widehat{\beta}-\widetilde{\beta}\right)+\left(\widehat{\beta}-\widetilde{\beta}\right)^{\prime}\underbrace{\boldsymbol{\mathrm{X}}^{\prime}e}_{?}+\left(\widehat{\beta}-\widetilde{\beta}\right)^{\prime}\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\left(\widehat{\beta}-\widetilde{\beta}\right).$

Exercises, continued

Finally, show the following set of results which are special for the case of linear equality constraints:
- First, from the point of view of the objective functions of the unconstrained/unrestricted and the constrained/restricted LS problems: $\widetilde{e}^{\prime}\widetilde{e}-e^{\prime}e=\left(\widehat{\beta}-\widetilde{\beta}\right)^{\prime}\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\left(\widehat{\beta}-\widetilde{\beta}\right).$
- Second, from the point of view of the unconstrained/unrestricted LS problem alone: $\widetilde{e}^{\prime}\widetilde{e}-e^{\prime}e=\left(R\widehat{\beta}-r\right)^{\prime}\left[R\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}R^{\prime}\right]^{-1}\left(R\widehat{\beta}-r\right).$
- Third, from the point of view of the constrained/restricted LS problem alone: $\widetilde{e}^{\prime}\widetilde{e}-e^{\prime}e=\widetilde{e}^{\prime}P_{\boldsymbol{\mathrm{X}}}\widetilde{e}.$ Do you know why $\boldsymbol{\mathrm{X}}^{\prime}\widetilde{e}\neq 0$ ?.

What was the point of the previous exercises?

Do you need a linear regression model any point during the derivations?
You saw that there are three ways to view the increase in SSR as a result of imposing the linear equality restrictions $R\beta=r$ .
These three ways represent the “trinity” and you might see these again and again. The “trinity” is most associated with a likelihood ratio test (LR), a Wald test (W), and a score or Lagrange multiplier test (LM).
Unfortunately, the algebraic results are limited in economic applications. But the spirit behind the “trinity” holds up in many applications! Be sure to recognize them as you go along.

Detour: Expected values applied to matrices

The rules you know for the scalar case extend to the multi-dimensional case, but with some slight changes (mainly to respect the rules of matrix multiplication). Recall that for $a\in\mathbb{R}$ and a random variable $X$ , we have:
- $\mathbb{E}\left(aX\right) = a\mathbb{E}\left(X\right)$
- $\mathsf{Var}\left(aX\right) = a^2\mathsf{Var}\left(X\right)$
If $X$ is a random vector or a random matrix, then $\mathbb{E}\left(X\right)$ is just a vector or matrix of expected values.
If $X$ is a random vector, then $\mathsf{Var}\left(X\right)=\mathbb{E}\left(X-\mathbb{E}\left(X\right)\right)\left(X-\mathbb{E}\left(X\right)\right)^\prime$ is a matrix of covariances.
Let $\underset{m\times n}{A}$ be a nonstochastic matrix and $\underset{n\times 1}{X}$ be a random vector. Then,
- $\mathbb{E}\left(AX\right) = A\mathbb{E}\left(X\right)$
- $\mathsf{Var}\left(AX\right) = A\mathsf{Var}\left(X\right)A^\prime$

IID conditions versus conditioning on $\boldsymbol{\mathrm{X}}$

As illustrated in past slides, the assumptions in Chapter 3 can be written in different forms.
You can condition on all observations $\boldsymbol{\mathrm{X}}$ without any additional IID conditions. So, you will see assumptions with $|\boldsymbol{\mathrm{X}}$ .
If you choose to impose IID conditions, then you can reduce the conditioning set.
- For example, if you assume that you have IID draws from $\left(Y,X_1,\ldots,X_k\right)$ leading to a dataset $\{\left(Y_t,X_{1t},\ldots, X_{kt}\right)\}$ , then for every $t$ , $\varepsilon_t | \boldsymbol{\mathrm{X}} \sim \varepsilon_t | X_{1t},\ldots, X_{kt}.$
If you choose not to impose IID conditions, then you have to condition on $\boldsymbol{\mathrm{X}}$ provided that you are interested in the finite-sample theory discussed in Chapter 3.

IID conditions versus taking $\boldsymbol{\mathrm{X}}$ as fixed in repeated samples

There are textbook discussions of the linear regression model where $\boldsymbol{\mathrm{X}}$ are directly treated as constants.
The analysis is very similar to the version where you condition on $\boldsymbol{\mathrm{X}}$ , especially for the finite-sample theory. The notation is likely easier to handle.
However, the asymptotic theory is a bit different.
- Take the simplest and quite restrictive case where $Y_t\sim N\left(\beta_1 x_{t},\sigma^2\right)$ .
- Assume that $x_1,\ldots,x_n$ are fixed scalars.
- Are $Y_1, Y_2, \ldots, Y_n$ independent and identically distributed?

Statistical properties of the OLS estimator

Think of the LS minimizer as an estimator. We now evaluate its statistical properties in finite samples.
In the simplest case, you already know that to obtain unbiasedness, you need to impose correct specification and condition on all the observations of the $X$ ’s. Of course, you can impose IID conditions or fix $\mathbf{X}$ in repeated samples, but correct specification has to be there.
Therefore, you should expect an extension of the assumptions you have seen before to the more general case with more than one $X$ .
Short proof:
- First, we have $\widehat{\beta}=\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\left(\boldsymbol{\mathrm{X}}^{\prime}Y\right)=\beta^o+\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\left(\boldsymbol{\mathrm{X}}^{\prime}\varepsilon\right)$
- Next, we have $\mathbb{E}\left(\widehat{\beta}|\mathbf{X}\right)=\beta^o+\mathbb{E}\left[\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\left(\boldsymbol{\mathrm{X}}^{\prime}\varepsilon\right)|\mathbf{X}\right]=\beta^o+\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}\mathbb{E}\left(\varepsilon|\boldsymbol{\mathrm{X}}\right)=\beta^{o}$

Assumptions to obtain unbiasedness

You need that for all $t=1,2,\ldots,n$ :
- 3.1 $Y_t=X_t^\prime\beta^o+\varepsilon_t$ .
- 3.2 $\mathbb{E}\left(\varepsilon_{t}|X_{1},\ldots,X_{t},\ldots,X_{n}\right)=0$
- 3.3 $\sum_{t=1}^{n}}X_{t}X_{t}^{\prime$ is nonsingular.
In matrix form:
- 3.1 $Y=\boldsymbol{\mathrm{X}}\beta^o+\varepsilon$
- 3.2 $\mathbb{E}\left(\varepsilon|\mathbf{X}\right)=\mathbb{E}\left(\varepsilon|X_{1},\ldots,X_{n}\right)=\begin{pmatrix}\mathbb{E}\left(\varepsilon_{1}|X_{1},\ldots,X_{t},\ldots,X_{n}\right)\\ \vdots\\ \mathbb{E}\left(\varepsilon_{t}|X_{1},\ldots,X_{t},\ldots,X_{n}\right)\\ \vdots\\ \mathbb{E}\left(\varepsilon_{n}|X_{1},\ldots,X_{t},\ldots,X_{n}\right) \end{pmatrix}=\begin{pmatrix}0\\ \vdots\\ 0\\ \vdots\\ 0 \end{pmatrix}$
- 3.3 $\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}$ is nonsingular.

Summary output from lm() for MRW(1992)

Take note that whether or not Assumptions 3.1 to 3.2 are satisfied, the numbers you have seen in previous slides can be computed.

summary(MRW.TableI.nonoil)

## 
## Call:
## lm(formula = ly85 ~ linv + lpop, data = subset(MRW, MRW$n == 
##     1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7914 -0.3937  0.0412  0.4337  1.5805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.430      1.584    3.43  0.00090 ***
## linv           1.424      0.143    9.95  < 2e-16 ***
## lpop          -1.990      0.563   -3.53  0.00064 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.689 on 95 degrees of freedom
## Multiple R-squared:  0.601,  Adjusted R-squared:  0.592 
## F-statistic: 71.5 on 2 and 95 DF,  p-value: <2e-16

Statistical properties of the OLS estimator, continued

To obtain standard errors, we need the covariance matrix of $\widehat{\beta}$ , you need the covariance matrix of the sampling distribution of $\widehat{\beta}$ .
You did not see this before in the simple case. What we did was to derive the form of the asymptotic variance using the central limit theorem as a starting point.
A short argument: $\mathsf{Var}\left(\widehat{\beta}|\mathbf{X}\right)=\mathsf{Var}\left[\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\left(\boldsymbol{\mathrm{X}}^{\prime}\varepsilon\right)|\mathbf{X}\right]=\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}\mathsf{Var}\left(\varepsilon|\boldsymbol{\mathrm{X}}\right)\boldsymbol{\mathrm{X}}\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}.$
Notice the form: Looks like a sandwich, right?
Can we estimate it from the data?
- Yes, it is possible but as with anything we estimate, we should make sure that the estimator of the covariance matrix has good statistical properties.
- But it is a bit difficult because there are at most $n^2$ entries of $\mathsf{Var}\left(\varepsilon|\mathbf{X}\right)$ that we have to “fill up”!

Simplifying the form of the covariance matrix

$\mathsf{Var}\left(\varepsilon|\mathbf{X}\right)$ can be unpacked as follows:

$\mathsf{Var}\left(\varepsilon|\mathbf{X}\right)=\begin{pmatrix}\mathsf{Var}\left(\varepsilon_{1}|\mathbf{X}\right) & \mathsf{Cov}\left(\varepsilon_{1},\varepsilon_{2}|\mathbf{X}\right) & \ldots & \mathsf{Cov}\left(\varepsilon_{1},\varepsilon_{n}|\mathbf{X}\right)\\ \mathsf{Cov}\left(\varepsilon_{2},\varepsilon_{1}|\mathbf{X}\right) & \mathsf{Var}\left(\varepsilon_{2}|\mathbf{X}\right) & \ldots & \mathsf{Cov}\left(\varepsilon_{2},\varepsilon_{n}|\mathbf{X}\right)\\ \vdots & \vdots & \ddots & \vdots\\ \mathsf{Cov}\left(\varepsilon_{n},\varepsilon_{1}|\mathbf{X}\right) & \mathsf{Cov}\left(\varepsilon_{n},\varepsilon_{2}|\mathbf{X}\right) & \ldots & \mathsf{Var}\left(\varepsilon_{n}|\mathbf{X}\right) \end{pmatrix}.$

If the spherical variance assumption 3.4 $\mathsf{Var}\left(\varepsilon|\mathbf{X}\right)=\sigma^{2}I_{n}$ holds, then this implies that
- $\varepsilon_{t}$ ’s are conditionally homoskedastic, i.e., $\mathsf{Var}\left(\varepsilon_t|\mathbf{X}\right)=\sigma^{2}$ for all $t=1,\ldots,n$ .
- $\varepsilon_{t}$ ’s are conditionally serially or spatially uncorrelated or non-autocorrelated, i.e., $\mathsf{Cov}\left(\varepsilon_{t},\varepsilon_{s}|\mathbf{X}\right)=0$ for all $t\neq s$ .
Under 3.1 to 3.4, we will have $\mathsf{Var}\left(\widehat{\beta}|\mathbf{X}\right)=\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}\mathsf{Var}\left(\varepsilon|\boldsymbol{\mathrm{X}}\right)\boldsymbol{\mathrm{X}}\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}=\sigma^2\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}.$
3.4 is a strong assumption which simplifies things and reduces the number of extra parameters to estimate. Instead of at most $n^2$ entries, we only need to estimate just one: $\sigma^2$ .

What happens if Assumption 3.4 is not satisfied and 3.1-3.3 are still satisfied?

The covariance matrix of $\widehat{\beta}$ derived under 3.4, and, as a result, the associated standard errors are incorrect.
- Directly fix the standard errors. The key object is to somehow estimate $\mathsf{Var}\left(\varepsilon|\boldsymbol{\mathrm{X}}\right)$ .
- Consider a different estimator. This is the point where you think about generalizing least squares (Section 3.9 of main textbook). The idea is to find a way to make assumptions about $\mathsf{Var}\left(\varepsilon|\boldsymbol{\mathrm{X}}\right)$ beyond 3.4, say $\mathsf{Var}\left(\varepsilon|\boldsymbol{\mathrm{X}}\right)=V$ where $V$ depends on not too many parameters.
- Combine the two options in some way. This is the key idea behind Romano and Wolf (2017), but the justification is for large samples.
If you choose to directly fix the standard errors, things get slightly complicated. Here I discuss the case where the errors are conditionally non-autocorrelated but conditionally heteroscedastic. The other option will be discussed later in Chapter 6.
- $\mathsf{Var}\left(\varepsilon|\boldsymbol{\mathrm{X}}\right)=\mathsf{diag}\{\sigma^2_1,\ldots, \sigma^2_n\}$ has to be estimated.
- One approach is to use $\mathsf{diag}\{\varepsilon^2_1,\ldots, \varepsilon^2_n\}$ . This leads to an unbiased estimator of the “meat”.
- Unfortunately, $\varepsilon_t^2$ ’s are unobservable.

Exercises

Assume that the errors are conditionally non-autocorrelated but conditionally heteroscedastic. Show that conditional on $\mathbf{X}$ , $\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}\mathsf{diag}\{\varepsilon^2_1,\ldots, \varepsilon^2_n\} \boldsymbol{\mathrm{X}}\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}$ is an unbiased estimator of $\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}\mathsf{Var}\left(\varepsilon|\boldsymbol{\mathrm{X}}\right)\boldsymbol{\mathrm{X}}\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}$ .
Let $e$ be the residual vector from the least squares fit. Impose Assumptions 3.1 to 3.4. Find $\mathbb{E}\left(e|\mathbf{X}\right)$ and $\mathsf{Var}\left(e|\mathbf{X}\right)$ .
It might seem strange to find $\mathsf{Var}\left(e|\mathbf{X}\right)$ under Assumption 3.4. Will the expected value of $\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}\mathsf{diag}\{e^2_1,\ldots, e^2_n\} \boldsymbol{\mathrm{X}}\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}$ be equal to $\sigma^2\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}$ , conditional on $\mathbf{X}$ ?

What happens if Assumption 3.4 is not satisfied and 3.1-3.3 are still satisfied?

You automatically lose the Gauss-Markov property of $\widehat{\beta}$ .
- The Gauss-Markov property states that in the class of linear unbiased estimators of $\beta^o$ , $\widehat{\beta}$ has the lowest variance conditional on $\mathbf{X}$ .
- This is a classic result that inspired a lot of research into $\widehat{\beta}$ , moving beyond linear estimators, etc.
- Recently, there has been a proof that states you can remove the “linear” requirement. See A Modern Gauss Markov Theorem by Hansen (2022), but Poetscher and Preinerstorfer (2022) asks really??
What is the meaning of lowest variance when it comes to matrices?
It means that conditional on $\boldsymbol{\mathrm{X}}$ and for any $\widehat{b}$ , the difference in the covariance matrices of $\widehat{b}$ and $\widehat{\beta}$ , which is $\mathsf{Var}\left(\widehat{b}|\mathbf{X}\right)-\mathsf{Var}\left(\widehat{\beta}|\mathbf{X}\right)$ , is positive semi-definite.
In other words, for any $\tau\in\mathbb{R}^{k+1}$ , $\tau^\prime \left[\mathsf{Var}\left(\widehat{b}|\mathbf{X}\right)-\mathsf{Var}\left(\widehat{\beta}|\mathbf{X}\right)\right]\tau\geq 0.$
Try setting $\tau^\prime=(1,0,\ldots,0)$ or $\tau^\prime=(1,-1,\ldots,0)$ .

Proof of Gauss-Markov property

Here I show you a proof from White and Cho (2012).
First, observe that $\widehat{\beta}$ could be written as $\widehat{\beta}=AY$ where $A=\left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}$ . In addition, $AA^\prime= \left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}$ .
Second, observe that an alternative linear estimator $\widehat{b}$ can be written as $\widehat{b}=BY$ , where $B$ is a $\left(k+1\right)\times n$ matrix which only depends on $\boldsymbol{\mathrm{X}}$ . In addition, for $\widehat{b}$ to be unbiased, we must have $B\mathbf{X}=I_{k+1}$ . Finally, $AB^\prime=AA^\prime=BA^\prime$ .
Third, note that $\mathsf{Var}\left(\widehat{\beta}|\mathbf{X}\right)=\sigma^2 AA^\prime$ and $\mathsf{Var}\left(\widehat{b}|\mathbf{X}\right)=\sigma^2 BB^\prime$ .
Finally, we have $\mathsf{Var}\left(\widehat{b}|\mathbf{X}\right)-\mathsf{Var}\left(\widehat{\beta}|\mathbf{X}\right)=\sigma^2\left(B-A\right)\left(B-A\right)^\prime.$

Why do most misunderstand the consequences of multicollinearity?

This is based on Exercises 3.8 and 3.9 of the main textbook.
If there is imperfect multicollinearity, then the variances of the estimators of the individual coefficients are going to be large, leading to imprecision.
This is not a defect, but a correct representation of the reality of empirical practice. Standard errors should reflect the consequences of imperfect multicollinearity.
Note that there is no “fixing” of this imperfect multicollinearity. In fact, there are cases where having imperfect multicollinearity could be a good thing.
It is possible that linear combinations of individual coefficients could be estimated more precisely under imperfect multicollinearity.

Exercises

Focus on the least squares estimator for the $k$ th coefficient and use the notation in the previous slide on FWL. Let $M_{-k}$ be shorter notation for the residual maker $M_{\mathbf{X}_{-k}}$ .
- Use FWL to show that $\widehat{\beta_{k}}=\left(\mathbf{X}_{k}^{\prime}M_{-k}\mathbf{X}_{k}\right)^{-1}\mathbf{X}_k^{\prime}M_{-k}Y.$
- Show that under Assumptions 3.1 to 3.4, $\mathsf{Var}\left(\widehat{\beta_{k}}|\mathbf{X}\right)=\sigma^2 \left(\mathbf{X}_{k}^{\prime}M_{-k}\mathbf{X}_{k}\right)^{-1}.$
- Use the definition of centered R-squared to rewrite $\mathsf{Var}\left(\widehat{\beta_{k}}|\mathbf{X}\right)$ as a function of centered R-squared of a least squares fit of the $k$ th regressor on all other regressors.
- Interpret the results.

Understanding how to estimate $\sigma^2$

Once again, by linearity: $\left(\begin{array}{c} Y_{1}\\ Y_{2}\\ \vdots\\ Y_{n} \end{array}\right)=\beta_0^o\left(\begin{array}{c} 1\\ 1\\ \vdots\\ 1 \end{array}\right)+\beta_1^o\left(\begin{array}{c} X_{11}\\ X_{12}\\ \vdots\\ X_{1n} \end{array}\right)+\left(\begin{array}{c} \varepsilon_{1}\\ \varepsilon_{2}\\ \vdots\\ \varepsilon_{n} \end{array}\right)$
Reparameterize to $\left(\begin{array}{c} Y_{1}\\ Y_{2}\\ \vdots\\ Y_{n} \end{array}\right)=\left(\beta_0^o+\beta_1^o \overline{X}_1\right)\left(\begin{array}{c} 1\\ 1\\ \vdots\\ 1 \end{array}\right)+\beta_1^o\left(\begin{array}{c} X_{11}-\overline{X}_1\\ X_{12}-\overline{X}_1\\ \vdots\\ X_{1n}-\overline{X}_1 \end{array}\right)+\left(\begin{array}{c} \varepsilon_{1}\\ \varepsilon_{2}\\ \vdots\\ \varepsilon_{n} \end{array}\right)$

Orthogonal transformations

Consider the following orthogonal transformation $A=\left(\begin{array}{cccc} \dfrac{1}{\sqrt{n}} & \dfrac{1}{\sqrt{n}} & \cdots & \dfrac{1}{\sqrt{n}}\\ \dfrac{X_{11}-\overline{X}_{1}}{\sqrt{\sum\left(X_{1t}-\overline{X}_{1}\right)^{2}}} & \dfrac{X_{12}-\overline{X}_{1}}{\sqrt{\sum\left(X_{1t}-\overline{X}_{1}\right)^{2}}} & \cdots & \dfrac{X_{1n}-\overline{X}_{1}}{\sqrt{\sum\left(X_{1t}-\overline{X}_{1}\right)^{2}}}\\ a_{31} & a_{32} & \cdots & a_{3n}\\ \vdots & \vdots & \ddots & \vdots\\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{array}\right).$
This transformation exists and is justified by the Gram-Schmidt orthogonalization procedure (full QR version).

Impact of orthogonal transformation

As a result, $A\left(\begin{array}{c} Y_{1}\\ Y_{2}\\ \vdots\\ Y_{n} \end{array}\right)=\left(\begin{array}{c} \sqrt{n}\overline{Y}\\ \dfrac{\sum\left(X_{1t}-\overline{X}_{1}\right)Y_t}{\sqrt{\sum\left(X_{1t}-\overline{X}_{1}\right)^{2}}}\\ \sum a_{3j}Y_{j}\\ \vdots\\ \sum a_{3j}Y_{j} \end{array}\right),\quad A\left(\begin{array}{c} 1\\ 1\\ \vdots\\ 1 \end{array}\right)=\left(\begin{array}{c} \sqrt{n}\\ 0\\ \vdots\\ 0 \end{array}\right),\quad A\left(\begin{array}{c} X_{11}-\overline{X}_1\\ X_{12}-\overline{X}_1\\ \vdots\\ X_{1n}-\overline{X}_1 \end{array}\right)=\left(\begin{array}{c} 0\\ \sqrt{\sum\left(X_{1t}-\overline{X}_{1}\right)^{2}}\\ 0\\ \vdots\\ 0 \end{array}\right), \quad A\left(\begin{array}{c} \varepsilon_{1}\\ \varepsilon_{2}\\ \vdots\\ \varepsilon_{n} \end{array}\right)=\left(\begin{array}{c} \nu_{1}\\ \nu_{2}\\ \vdots\\ \nu_{n} \end{array}\right)$

Impact of orthogonal transformation, continued

Therefore, we have the following system of equations: $\begin{eqnarray}\sqrt{n}\overline{Y} &=& \sqrt{n}\left(\beta_{0}^o+\beta_{1}^o\overline{X}_{1}\right)+\nu_{1} \\ \dfrac{\sum\left(X_{1t}-\overline{X}_{1}\right)Y_t}{\sqrt{\sum\left(X_{1t}-\overline{X}_{1}\right)^{2}}} &=& \beta_{1}^o\sqrt{\sum\left(X_{1t}-\overline{X}_{1}\right)^{2}}+\nu_{2}\\ \sum a_{3j}Y_{j} &=& \nu_{3}\\ & \vdots &\\ \sum a_{nj}Y_{j} &=& \nu_{n}.\end{eqnarray}$

Impact of orthogonal transformation, continued

Because $A$ is an orthogonal matrix, $\begin{eqnarray}\sum\varepsilon_{t}^{2} &=& \sum\nu_{t}^{2}\\ \sum\left(Y_{t}-\beta_{0}^{o}-\beta_{1}^{o}X_{1t}\right)^{2} &=& \left[\sqrt{n}\overline{Y}-\sqrt{n}\left(\beta_{0}^{o}+\beta_{1}^{o}\overline{X}_{1}\right)\right]^{2}\\ && +\left[\dfrac{\sum\left(X_{1t}-\overline{X}_{1}\right)Y_t}{\sqrt{\sum\left(X_{1t}-\overline{X}_{1}\right)^{2}}}-\beta_{1}^{o}\sqrt{\sum\left(X_{1t}-\overline{X}_{1}\right)^{2}}\right]^{2}\\ && +\nu_{3}^{2}+\cdots+\nu_{n}^{2}\end{eqnarray}$
Observe that $\sum\left(Y_{t}-\widehat{\beta}_{0}-\widehat{\beta}_{1}X_{1t}\right)^{2} = \nu_{3}^{2}+\cdots+\nu_{n}^{2}.$ Therefore, the minimized SSR depends on only $n-2$ remaining pieces of the errors after orthogonal transformation.

Constructing an estimator for $\sigma^2$

Suppose Assumptions 3.1, 3.2, 3.3, and 3.4 hold. Then conditional on $X_{11},\ldots,X_{1n}$ , the orthogonally transformed errors $\nu_3,\ldots,\nu_n$ have a common mean 0 and common variance $\sigma^2$ . This is a good exercise to actually do to check whether you can write the argument.
The point is that $\mathbb{E}\left[\sum\left(Y_{t}-\widehat{\beta}_{0}-\widehat{\beta}_{1}X_{1t}\right)^{2}|X_{11},\ldots,X_{1n}\right] = \mathbb{E}\left(\nu_{3}^{2}+\cdots+\nu_{n}^{2}|X_{11},\ldots,X_{1n}\right)=\left(n-2\right)\sigma^2.$
This means that a possible way to estimate $\sigma^2$ is to consider the SSR divided by $n-2$ , i.e., $s^2=\dfrac{\sum\left(Y_{t}-\widehat{\beta}_{0}-\widehat{\beta}_{1}X_{1t}\right)^{2}}{n-2},$ which results in an unbiased estimator of $\sigma^2$ conditional on $X_{11},\ldots,X_{1n}$ .
Furthermore, you can extend the ideas to the general case: $s^2=\dfrac{e^\prime e}{n-\left(k+1\right)}.$

Replicating Table I of MRW (1992), continued

The covariance matrix (along with the standard errors) were calculated under conditional homoscedasticity because of MRW imposed it.
Do you think Assumption 3.4 is applicable?

cov.TableI.nonoil <- vcov(MRW.TableI.nonoil) # Estimated covariance matrix
cov.TableI.nonoil

##             (Intercept)   linv   lpop
## (Intercept)      2.5087 0.0983 0.8799
## linv             0.0983 0.0205 0.0229
## lpop             0.8799 0.0229 0.3174

sqrt(diag(cov.TableI.nonoil)) # Estimated standard errors

## (Intercept)        linv        lpop 
##       1.584       0.143       0.563

The normal linear regression model

You have already seen a version of this model for the simple case.
Here I summarize the main results without going through the proofs because the results are mainly an application of distributional theory related to the normal distribution.
What I will show you in the next slides will be in the context of the simplest case, because all the ideas are already there.
Assumption 3.5 states that $\varepsilon|\mathbf{X} \sim N\left(0,\sigma^{2}I_{n}\right)$ . As discussed before, this implies Assumptions 3.2 and 3.4 directly.
Because $\widehat{\beta}-\beta^o = \left(\boldsymbol{\mathrm{X}}^{\prime}\boldsymbol{\mathrm{X}}\right)^{-1}\boldsymbol{\mathrm{X}}^{\prime}\varepsilon$ , we have $\widehat{\beta}|\mathbf{X}\sim N\left(\beta^{o},\sigma^{2}\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}\right)$

The normal linear regression model, continued

Because $\dfrac{\left(n-\left(k+1\right)\right)s^{2}}{\sigma^{2}}=\left(\dfrac{\varepsilon}{\sigma}\right)^\prime M_{\boldsymbol{\mathrm{X}}}\left(\dfrac{\varepsilon}{\sigma}\right)$ is a quadratic form of normal random variables conditional on ${\boldsymbol{\mathrm{X}}}$ , $\dfrac{\left(n-\left(k+1\right)\right)s^{2}}{\sigma^{2}}\sim \chi^2_{n-\left(k+1\right)}$ has a known distribution conditional on $\boldsymbol{\mathrm{X}}$ .
Because $\left(\begin{array}{c} \widehat{\beta}-\beta^{o}\\ e \end{array}\right) = \left(\begin{array}{c} \left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\\ M_{\boldsymbol{\mathrm{X}}} \end{array}\right)\varepsilon,$ and $\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}M_{\boldsymbol{\mathrm{X}}}=0$ , we must have statistical independence of $\widehat{\beta}-\beta^{o}$ and $e$ conditional on $\boldsymbol{\mathrm{X}}$ . Therefore $\widehat{\beta}$ and $s^2$ conditional on $\boldsymbol{\mathrm{X}}$ are statistically independent of each other.
The previous point is perhaps the most crucial ingredient to do exact finite-sample inference under normality.

Inference under the normal linear regression model

Let $\underset{\left(J\times\left(k+1\right)\right)}{R}$ be a nonstochastic matrix.
Since $\widehat{\beta}|\mathbf{X}\sim N\left(\beta^{o},\sigma^{2}\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}\right)$ , this implies that $R\widehat{\beta}|\mathbf{X} \sim N\left(R\beta^{o},\sigma^{2}R\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}R^{\prime}\right)$ .
Let $W_1=\left(R\widehat{\beta}-R\beta^{o}\right)^{\prime}\left(\sigma^{2}R\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}R^{\prime}\right)^{-1}\left(R\widehat{\beta}-R\beta^{o}\right)$ and $W_2=\left(n-\left(k+1\right)\right)s^2/\sigma^2$
Note that conditional on $\mathbf{X}$ , we have $W_1\sim\chi^2_J$ , $W_2\sim\chi^2_{n-\left(k+1\right)}$ , and $W_{1}$ and $W_{2}$ are independent.
To construct confidence intervals for $R\beta^o$ , we need a pivotal quantity which is: $\dfrac{W_{1}/J}{W_{2}/\left(n-\left(k+1\right)\right)}\bigg|\mathbf{X}\sim F_{J,n-\left(k+1\right)}.$
But what does the left hand side actually look like? Remember the form: ${\displaystyle \dfrac{W_{1}/J}{W_{2}/\left(n-\left(k+1\right)\right)}}=\left(R\widehat{\beta}-R\beta^{o}\right)^{\prime}(R\ \overbrace{s^2\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}}^{\mathsf{Var}\left(\widehat{\beta}|\mathbf{X}\right)}R^{\prime})^{-1}\left(R\widehat{\beta}-R\beta^{o}\right)/J.$

Constructing confidence sets

To construct an exact finite-sample $100\left(1-\alpha\right)\%$ confidence set for $R\beta^{o}$ under normality, you have to provide $R$ and then find the $\left(1-\alpha\right)$ quantile of the $F_{J,n-\left(k+1\right)}$ , call it $c_{1-\alpha}$ . The confidence set will have the form $\left(R\widehat{\beta}-R\beta^{o}\right)^{\prime}\left(Rs^{2}\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}R^{\prime}\right)^{-1}\left(R\widehat{\beta}-R\beta^{o}\right)/J\leq c_{1-\alpha}.$
Geometrically, this confidence set would like a shaded ellipsoid. This shape has implications for constructing joint confidence intervals compared to combining separate confidence intervals together.

Constructing hypothesis tests

To test the hypothesis $R\beta^{o}=r$ at the $100\alpha\%$ significance level, you have to provide $R$ and $r$ and compute what some call an $F$ -statistic: $F=\left(R\widehat{\beta}-r\right)^{\prime}\left(Rs^{2}\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}R^{\prime}\right)^{-1}\left(R\widehat{\beta}-r\right)/J$ and then calculate either a critical value from $F_{J,n-\left(k+1\right)}$ or a one-sided $p$ -value $\Pr\left(F_{J,n-\left(k+1\right)}\geq F^{act}|H_0\right)$ , where $F^{act}$ is the actual value of the statistic computed from the data.
Just like in confidence sets, the test is really a joint test. The results from a joint test are not necessarily the same as testing each of the $J$ hypotheses one at a time.
The test statistic reduces to a $t$ -statistic for the special case where $J=1$ .

Replicating Table I of MRW (1992), continued

The Solow hypothesis is that $\beta_1=-\beta_2$ .
What are $R$ and $r$ for the situation here? How many restrictions?
To justify the procedure with what you know so far, you need to impose another assumption. What is it?

R.mat <- c(0, 1, 1) # R matrix for testing Solow hypothesis
# Calculate F statistic for testing Solow hypothesis
test.stat <- t(R.mat %*% coef.TableI.nonoil) %*% solve(R.mat %*% cov.TableI.nonoil %*% R.mat) %*%
  R.mat %*% coef.TableI.nonoil
# Test statistic, p-value, critical value
c(test.stat, 1-pf(test.stat, 1, 95), qf(0.95, 1, 95))

## [1] 0.834 0.363 3.941

Replicating Table I of MRW (1992), continued

The restricted regression is the model where the Solow hypothesis is imposed, i.e. $\log y_{i}^{*} = \beta_{0}+\beta_{1}\left[\log s_{i}-\log\left(n_{i}+g+\delta\right)\right]+\varepsilon_{i}.$

# Generate new variable for restricted regression
MRW$ldiff <- MRW$linv - MRW$lpop
# Apply OLS to restricted regression
MRW.TableI.restricted.nonoil <- lm(ly85 ~ ldiff, data = subset(MRW, MRW$n==1))
# Compute test using SSR comparisons
anova(MRW.TableI.nonoil, MRW.TableI.restricted.nonoil)

## Analysis of Variance Table
## 
## Model 1: ly85 ~ linv + lpop
## Model 2: ly85 ~ ldiff
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1     95 45.1                         
## 2     96 45.5 -1    -0.396 0.83   0.36

Replicating Table I of MRW (1992), continued

It is possible to test the Solow hypothesis after a reparameterization.
Let $\beta_1+\beta_2=\theta$ . So, the Solow hypothesis becomes a test of $\theta=0$ .
The reparameterized model is now $\log y_{i}^{*} = \beta_{0}+\beta_1\left[\log s_{i}-\log\left(n_{i}+g+\delta\right)\right]+\theta\log\left(n_{i}+g+\delta\right)+\varepsilon_{i}.$
- Thanks to 梁慰 for pointing out a very silly mistake of mine here.

# Apply OLS to reparameterized model
MRW.TableI.repar.nonoil <- lm(ly85 ~ ldiff + lpop, data = subset(MRW, MRW$n==1))
summary(MRW.TableI.repar.nonoil)[[4]] # Focusing on the summary of the coefficients

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    5.430      1.584   3.428 9.00e-04
## ldiff          1.424      0.143   9.951 2.10e-16
## lpop          -0.566      0.619  -0.913 3.63e-01

Replicating Table I of MRW (1992), continued

To compute the implied alpha, the restricted regression would have to be the basis. How are $\alpha$ and $\beta_1$ related?
There is no point in computing the implied alpha if the Solow hypothesis is rejected.
How do you obtain the standard error for the estimate of the implied alpha? Given what you have in Chapter 3, it is impossible. But you have encountered the delta method before. What is missing then?

est.beta1 <- coef(MRW.TableI.restricted.nonoil)[[2]]
implied.alpha <- est.beta1/(1+est.beta1)
implied.alpha

## [1] 0.598

est.var <- c(0, 1/(1+est.beta1)^2) %*% vcov(MRW.TableI.restricted.nonoil) %*% c(0, 1/(1+est.beta1)^2)
delta.method.se <- sqrt(est.var)
delta.method.se

##        [,1]
## [1,] 0.0201

Exercises

Derive the corresponding $t$ -distribution results for $J=1$ .
Show that $F=\dfrac{\left(R\widehat{\beta}-r\right)^{\prime}\left(R\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}R^{\prime}\right)^{-1}\left(R\widehat{\beta}-r\right)/J}{s^{2}}=\dfrac{\left(\widetilde{e}^{\prime}\widetilde{e}-e^{\prime}e\right)/J}{e^{\prime}e/\left(n-\left(k+1\right)\right)}.$
Show that you can also express the previous version of the $F$ -statistic in terms of R-squared, but again this is of limited use. Do you know why? Why would the SSR version be preferable over the R-squared version of the $F$ -statistic?

Monte Carlo simulation: setup

Consider the following data generating process shown in class: $\left\{ \left(Y_{t},X_{1t}\right)\right\} _{t=1}^{n}$ are IID draws from $\begin{eqnarray*} Y_{t} &=& -1+2X_{1t}+\varepsilon_{t} \\ X_{1t} &\sim & Bin\left(1,0.3\right) \\ \varepsilon_{t}|X_{1t}=0 &\sim & N\left(0,1\right) \\ \varepsilon_{t}|X_{1t}=1 &\sim & N\left(0,1\right)\end{eqnarray*}$
Which of Assumptions 3.1, 3.2, 3.4, 3.5 are satisfied?
You should try modifying the code to the case where $N\left(0,1\right)$ is changed to $N\left(0,4\right)$ . Answer the previous question again and modify the code accordingly.
Try increasing or decreasing $n$ .
What about changing $Bin\left(1,0.3\right)$ to $Bin\left(1,0.8\right)$ ?
Introduce storage for $s^2$ and examine its center and distribution.

Monte Carlo simulation: code

set.seed(20220312)
n <- 50
# "True" beta values
beta0.o <- -1
beta1.o <- 2
reps <- 10^4
# Storage for OLS estimates (2 entries per replication)
beta.store <- matrix(NA, nrow=reps, ncol=2)
# Storage for robust covariance matrix (4 entries per replication, 2x2 matrix)
rob.store <- matrix(NA, nrow=reps, ncol=4)
# Storage for non-robust covariance matrix (4 entries per replication, 2x2 matrix)
nonrob.store <- matrix(NA, nrow=reps, ncol=4)
# Monte Carlo loop
for (i in 1:reps)
{
  X.t <- rbinom(n, 1, 0.3)  # Generate X
  eps.t <- (rnorm(n, 0, 1))*(X.t == 1)+(rnorm(n, 0, 1))*(X.t == 0)   # Generate epsilon
  Y.t <- beta0.o + beta1.o*X.t + eps.t   # Generate Y
  matXX <- t(cbind(1, X.t)) %*% cbind(1, X.t)  # X'X matrix
  beta.hat <- solve(matXX) %*% (t(cbind(1, X.t)) %*% Y.t)   # OLS
  # robust cov matrix
  bread <- matXX
  resid <- Y.t - cbind(1, X.t) %*% beta.hat
  meat <- (t(cbind(1, X.t)) %*% diag(c(resid^2))) %*% cbind(1, X.t)
  est.rob <- (solve(bread) %*% meat) %*% solve(bread)
  s.sq <- 1/(n-2) * sum(resid^2)   # estimator for sigma2 under cond. homoscedasticity
  est.nonrob <- s.sq * solve(matXX)  # nonrobust cov matrix
  beta.store[i,] <- c(beta.hat)
  rob.store[i,] <- c(est.rob)
  nonrob.store[i,] <- c(est.nonrob)
}

Monte Carlo simulation: center and spread

apply(beta.store, 2, mean) # Calculate the mean of the 10^4 beta's

## [1] -1  2

apply(beta.store, 2, sd) # Calculate the SD of the 10^4 beta's

## [1] 0.172 0.313

apply(sqrt(rob.store[,c(1,4)]), 2, mean) # Average SE robust

## [1] 0.166 0.301

apply(sqrt(nonrob.store[,c(1,4)]), 2, mean) # Average SE nonrobust

## [1] 0.169 0.313

Monte Carlo simulation: Behavior of $t$ -statistics

# Test statistic for null beta0 = -1 using nonrobust cov matrix
t.ratios.beta0 <- (beta.store[,1]-beta0.o)/sqrt(nonrob.store[,1])
# Test statistic for null beta0 = -1 using robust cov matrix
t.ratios.beta0.rob <- (beta.store[,1]-beta0.o)/sqrt(rob.store[,1])
# Test statistic for null beta1 = 2 using nonrobust cov matrix
t.ratios.beta1 <- (beta.store[,2]-beta1.o)/sqrt(nonrob.store[,4])
# Test statistic for null beta1 = 2 using robust cov matrix
t.ratios.beta1.rob <- (beta.store[,2]-beta1.o)/sqrt(rob.store[,4])
# Empirical rejection rate alpha = 0.05
mean(abs(t.ratios.beta0)>qnorm(0.975))

## [1] 0.0592

mean(abs(t.ratios.beta1)>qnorm(0.975))

## [1] 0.0542

mean(abs(t.ratios.beta0.rob)>qnorm(0.975))

## [1] 0.0653

mean(abs(t.ratios.beta1.rob)>qnorm(0.975))

## [1] 0.0677

Monte Carlo simulation: Distribution of $t$ -statistics

Exercises in the main textbook

3.1 repeats textbook material.
3.2 and 3.3 have been encountered in Lectures 1-5.
3.4 is about whether you understand what R-squared is measuring. 3.7 is introducing the adjusted R-squared.
3.5 is a nice exercise about reverse regression.
3.8 and 3.9 are about the effects of imperfect multicollinearity (we have covered them in the slides).
3.11 and 3.20 are short empirical applications.
3.12 checks whether you can verify the assumptions of the linear regression model.
3.14 to 3.18 are about restricted least squares (we have covered most of these in the slides) and the many ways of rewriting things along with special cases.
3.19 is covered in a practice exercise at SPOC.
3.23 and 3.24 are classic exercises about including and/or excluding a variable from a linear regression model and their consequences.
For the rest of the exercises, I would suggest skipping them because focusing on what I listed above are more of a priority.

Least squares algebra and finite-sample theory

Motivation: What do you gain from correct specification?

Illustration: Theory behind Mankiw, Romer, and Weil (1992)

Illustration: Econometrics behind Mankiw, Romer, and Weil (1992)

Illustration: Data used by Mankiw, Romer, and Weil (1992)

Replicating Table I of MRW (1992)

Interpreting coefficients: Dangers

Interpreting coefficients: Location-scale transformations

Interpreting coefficients: Interaction terms

Interpreting coefficients: Powers

Interpreting coefficients: Centering as a special case of location-scale transformations

Interpreting coefficients: logarithms

Interpreting coefficients: logarithms, continued

Setting everything up as matrices

Solving the LS problem

LS minimizer

Algebraic properties of LS minimizer

Exercises

Measures of fit

Measures of fit, continued

Exercises

Frisch-Waugh-Lovell (FWL) Theorem

The multicollinearity problem

Least squares with linear equality constraints or restrictions

Exercises

Exercises, continued

Exercises, continued

What was the point of the previous exercises?

Detour: Expected values applied to matrices

IID conditions versus conditioning on \boldsymbol{\mathrm{X}}\boldsymbol{\mathrm{X}}

IID conditions versus taking \boldsymbol{\mathrm{X}}\boldsymbol{\mathrm{X}} as fixed in repeated samples

Statistical properties of the OLS estimator

Assumptions to obtain unbiasedness

Summary output from lm() for MRW(1992)

Statistical properties of the OLS estimator, continued

Simplifying the form of the covariance matrix

What happens if Assumption 3.4 is not satisfied and 3.1-3.3 are still satisfied?

Exercises

What happens if Assumption 3.4 is not satisfied and 3.1-3.3 are still satisfied?

Proof of Gauss-Markov property

Why do most misunderstand the consequences of multicollinearity?

Exercises

Understanding how to estimate \sigma^2\sigma^2

Orthogonal transformations

Impact of orthogonal transformation

Impact of orthogonal transformation, continued

Impact of orthogonal transformation, continued

Constructing an estimator for \sigma^2\sigma^2

Replicating Table I of MRW (1992), continued

The normal linear regression model

The normal linear regression model, continued

Inference under the normal linear regression model

Constructing confidence sets

Constructing hypothesis tests

Replicating Table I of MRW (1992), continued

Replicating Table I of MRW (1992), continued

Replicating Table I of MRW (1992), continued

Replicating Table I of MRW (1992), continued

Exercises

Monte Carlo simulation: setup

Monte Carlo simulation: code

Monte Carlo simulation: center and spread

Monte Carlo simulation: Behavior of tt-statistics

Monte Carlo simulation: Distribution of tt-statistics

Exercises in the main textbook

IID conditions versus conditioning on $\boldsymbol{\mathrm{X}}$

IID conditions versus taking $\boldsymbol{\mathrm{X}}$ as fixed in repeated samples

Understanding how to estimate $\sigma^2$

Constructing an estimator for $\sigma^2$

Monte Carlo simulation: Behavior of $t$ -statistics

Monte Carlo simulation: Distribution of $t$ -statistics