Motivation: Adding flexibility to linear prediction

Linear prediction rules of the form \(\beta_0+\beta_1 X_1\) can be restrictive to some extent.
- Can we do better?
- We can add polynomial terms \(X_1^2, X_1^3, \ldots, X_1^p, \ldots\).
- That leads to a generalization of the linear predictor you have seen before. It is not very important to derive the best linear predictor but we assume that such an object will exist.
The optimization problem becomes \[\min_{\beta_{0},\beta_{1}, \ldots, \beta_p}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X_{1}-\beta_2X_1^2-\cdots \beta_p X_1^p\right)^{2}\right].\]
The minimized value of the objective function is decreasing as \(p\to\infty\) and must approach a limit. Furthermore, the best linear predictor as \(p\to\infty\) converges in quadratic mean (see Definition 4.1 of main textbook) to the an object called the conditional expectation \(\mathbb{E}\left(Y|X_1\right)\), i.e. \[\lim_{p\to\infty} \mathbb{E}\left(\mathbb{E}\left(Y|X_1\right)-\left(\beta_0^*+\beta_1^*X_1+\ldots+\beta_p^*X_1^p\right)\right)^2=0.\]

Motivation: Finite-sample unbiasedness of \(\widehat{\beta}_{1}\)

Consider the thought experiment where \(n\) is fixed, rather than \(n\to\infty\). We can ask whether \(\widehat{\beta}_{1}\) is unbiased for \(\beta_{1}^{*}\). Specifically, do we have \(\mathbb{E}\left(\widehat{\beta}_{1}\right)=\beta_{1}^{*}\)?
- To make things easier to look at, write \[\widehat{\beta}_{1}=\dfrac{\dfrac{1}{n}{\displaystyle \sum_{t=1}^{n}}\left(X_{1t}-\overline{X}_{1}\right)Y_{t}}{\dfrac{1}{n}{\displaystyle \sum_{t=1}^{n}}\left(X_{1t}-\overline{X}_{1}\right)^{2}}=\sum_{t=1}^{n}\left(\dfrac{X_{1t}-\overline{X}_{1}}{{\displaystyle \sum_{t=1}^{n}}\left(X_{1t}-\overline{X}_{1}\right)^{2}}\right)Y_{t}=\sum_{t=1}^{n}W_{t}Y_{t}.\]
- So, we have \[\mathbb{E}\left(\widehat{\beta}_{1}\right) = \mathbb{E}\left(\sum_{t=1}^{n}W_{t}Y_{t}\right)=\sum_{t=1}^{n}\mathbb{E}\left(W_{t}Y_{t}\right)\]and we already face the first hurdle.

Motivation: Resolving the unbiasedness question

One option is to invoke the law of iterated expectations and condition on ALL observations of \(X_{1}\). Therefore, \[\mathbb{E}\left(\left.\widehat{\beta}_{1}\right|X_{11},X_{12},\ldots,X_{1n}\right)=\sum_{t=1}^{n}\mathbb{E}\left(\left.W_{t}Y_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right)={\displaystyle \sum_{t=1}^{n}}W_{t}\mathbb{E}\left(\left.Y_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right).\]
Under the “modern” version of the linear regression model, we must have \(Y_{t}=\beta_{0}^{*}+\beta_{1}^{*}X_{1t}+u_{t}\). Thus, \[\mathbb{E}\left(\left.\widehat{\beta}_{1}\right|X_{11},X_{12},\ldots,X_{1n}\right)=\beta_{1}^{*}+{\displaystyle \sum_{t=1}^{n}}W_{t}\mathbb{E}\left(\left.u_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right).\]
We finally encounter the most crucial hurdle to establish unbiasedness. The only property we know about \(u_{t}\) is it has zero covariance with \(X_{1t}\). This is insufficient to establish that \(\mathbb{E}\left(\left.u_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right)=0\) even under IID conditions.
Therefore, finite-sample results require stronger assumptions to establish. This motivates us to revisit conditional expectations.

Predictions with additional information, revisited

Return to our linear prediction setting.
- Suppose we have two random variables \(X_{1}\) and \(Y\), which follow a joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\).
- Suppose you draw a unit at random from the subpopulation of units with \(X_{1}=x_{1}\). Your task is to predict this unit’s \(Y\).
- How do we accomplish this task optimally?
Consider a prediction rule of the form \(g\left(X_1\right)\) instead of \(\beta_{0}+\beta_{1}X_{1}\).
- The task is now to find the unique solution to the following optimization problem: \[\min_{g}\mathbb{E}\left[\left(Y-g\left(X_1\right)\right)^{2}\right].\]
- It is hard to use calculus directly here.
- Solve an easier problem first. Suppose you are looking at the subpopulation \(X_{1}=x_{1}\), provided that \(x_1 \in \mathsf{supp}\left(X_1\right)\).
- \(\mathbb{E}\left(Y|X_1=x_1\right)\) is the unique solution to the following optimization problem: \[\min_{\beta_{0}}\mathbb{E}\left[\left(Y-\beta_{0}\right)^{2}|X_1=x_1\right].\]

Conditional expectations

Notation overload:
- \(\mathbb{E}\left(Y|X_1=x_1\right)\) is a constant for a fixed \(x_1\).
- \(\mathbb{E}\left(Y|X_1=x_1\right)\) can also be thought of as a function of \(x_1\).
- \(\mathbb{E}\left(Y|X_1\right)\) is a random variable. What is its distribution?
Thus, we can use the add-and-subtract trick to verify that \(\mathbb{E}\left(Y|X_1\right)\) is the optimal \(g\) that solves \[\min_{g\in\mathbb{F}}\mathbb{E}\left[\left(Y-g\left(X_1\right)\right)^{2}\right].\]
- Start with \[\begin{aligned}\mathbb{E}\left[\left(Y-g\left(X_1\right)\right)^{2}\right] & =\mathbb{E}\left[\left(Y-\mathbb{E}\left(Y|X_1\right)\right)^{2}\right]+\mathbb{E}\left[\left(\mathbb{E}\left(Y|X_1\right)-g\left(X_1\right)\right)^{2}\right]\\ & +2\mathbb{E}\left[\left(Y-\mathbb{E}\left(Y|X_1\right)\right)\left(\mathbb{E}\left(Y|X_1\right)-g\left(X_1\right)\right)\right]. \end{aligned}\]
- Just like last time, you can show that the last term above is equal to zero. You will have to use the law of iterated expectations (LIE) here.

Law of iterated expectations and law of total variance

LIE is given by \[\mathbb{E}\left[G\left(X_1,Y\right)\right]=\mathbb{E}\left[\mathbb{E}\left[G\left(X_1,Y\right)|X_1\right]\right].\]
- Take note of which distributions are being used to calculate the corresponding expectations.
- A more commonly seen version is \[\mathbb{E}\left[Y\right]=\mathbb{E}\left[\mathbb{E}\left(Y|X_1\right)\right].\]
- Try interpreting the more commonly seen version of the law. You can also think about the case where \(Y\) and \(X_1\) are both discrete random variables with common support \(\left\{ 0,1\right\}\).
The law of total variance is given by \[\mathsf{Var}\left(Y\right)=\mathbb{E}\left[\mathsf{Var}\left(Y|X\right)\right]+\mathsf{Var}\left[\mathbb{E}\left(Y|X\right)\right].\]
- We will be providing an interpretation of this in the next slides.

Errors and MSE from best prediction

Let \(\varepsilon=Y-\mathbb{E}\left(Y|X_1\right)\).
- By construction, \(\mathbb{E}\left(\varepsilon|X_1\right)=0\).
- As a consequence, \(\mathbb{E}\left(\varepsilon\right)=0\) and \(\mathsf{Cov}\left(\varepsilon,X_1\right)=\langle\varepsilon, X_1\rangle=0\).
- In addition, if \(h\) is some measurable function of \(X_1\), then \(\mathsf{Cov}\left(\varepsilon,h\left(X\right)\right)=\langle\varepsilon, h\left(X_1\right)\rangle=0\).
Recall that \(\mathsf{Var}\left(Y\right)\) is the smallest loss from best prediction of \(Y\).
Similarly, \(\mathsf{Var}\left(Y|X_1=x_1\right)\) is the smallest loss from best prediction of \(Y\) given the subpopulation of units for which \(X_1=x_1\).
What will be the MSE of the best predictor? You can show that it is equal to \(\mathbb{E}\left[\mathsf{Var}\left(Y|X_1\right)\right]\).

Predictive interpretation of the law of total variance

This slide is based on Exercises 2.1 and 2.2 of the main textbook.
You have encountered two losses from prediction: \(\mathsf{Var}\left(Y\right)\) and \(\mathbb{E}\left[\mathsf{Var}\left(Y|X_1\right)\right]\). Are they related?
- First, show that \(\mathsf{Var}\left(Y|X_1\right)=\mathsf{Var}\left(\varepsilon|X_1\right)\).
- Next, show that \(\mathsf{Var}\left(Y\right)=\mathsf{Var}\left[\mathbb{E}\left(Y|X_1\right)\right]+\mathsf{Var}\left(\varepsilon\right)\).
- Finally, show that \(\mathsf{Var}\left(\varepsilon\right)=\mathbb{E}\left[\mathsf{Var}\left(Y|X_1\right)\right]\).
\(\mathbb{E}\left[\mathsf{Var}\left(Y|X_1\right)\right]\) is called within-group variation or unexplained variation while \(\mathsf{Var}\left(\mathbb{E}\left(Y|X_1\right)\right)\) is called between-group variation or explained variation.
- Extreme case 1: \(Y\) is independent of \(X_1\).
- Extreme case 2: \(Y\) is a known function of \(X_1\).

The CEF and the best linear predictor coud be different from each other.

The best predictor of \(Y\) given \(X_1\) is \(\mathbb{E}\left(Y|X_1\right)\).
The best linear predictor \(Y\) given \(X_1\) is \(\beta_0^*+\beta_1^*X_1\).
There is no guarantee that these two predictors are the same.
- I SEE THE MOUSE
- Let \(Y=10X_1^2+W\) where W and X are independent. You can show that \(\mathbb{E}\left(Y|X_1\right)=10X^{2}_1\).
  - If \(X_1\sim U\left(0,1\right)\) , then the best linear predictor is \(-\dfrac{5}{3}+10X_1\).
  - If \(X_1\sim U\left(-1,1\right)\), then the best linear predictor is \(\dfrac{10}{3}\).
  - If \(X_1\sim f_{X_1}\), where \[f_{X_1}\left(x\right)=\begin{cases} 0.5\left(1-x\right) & \mathsf{if}\ -1\leq x\leq1\\ 0 & \mathsf{otherwise} \end{cases},\] then the best linear predictor is \(2-4X_1\).

\(X_1\sim U\left(0,1\right)\)

\(X_1\sim U\left(-1,1\right)\)

But, they can be the same.

Unfortunately, not worth it to write down the conditions where the two could be equal.
Two popular examples:
- \(\left(X_1,Y\right)\) have a bivariate normal distribution.
- \(X_1\) is a binary random variable or dummy variable. This means that \(X_1\) takes on only two possible values 0 or 1. \(Y\) may be discrete or continuous.
The latter is probably most practically relevant. The latter is sometimes called a saturated model. (Related exercise from past lectures)
- Suppose \(X_1\) takes on only two values, 0 and 1. Therefore, we have \[\mathbb{E}\left(Y|X_1=0\right)=\pi_{0},\ \mathbb{E}\left(Y|X_1=1\right)=\pi_{1}.\]
- Therefore, \(\mathbb{E}\left(Y|X_1=x_1\right)=\pi_{0}+\left(\pi_{1}-\pi_{0}\right)x_1\). So, you can set \(\beta_{0}^{*}=\pi_{0}\) and \(\beta_{1}^{*}=\pi_{1}-\pi_{0}\).

Should we use the CEF or the best linear predictor?

Depends on the purpose.
- Machine learning situations typically focus on prediction, thus getting \(\mathbb{E}\left(Y|X_1\right)\) right is a priority, even if it may be harder to interpret or to communicate to general audiences.
- Is there some interpretable econometric object which can be recovered from the best linear predictor?
  - Regression derivative or “partial effect” \({\displaystyle \dfrac{\partial\mathbb{E}\left(Y|X_1\right)}{\partial X_1}}\)?
  - “Average partial effect” \(\mathbb{E}\left({\displaystyle \dfrac{\partial\mathbb{E}\left(Y|X_1\right)}{\partial X_1}}\right)\)?
Extrapolation could be tricky. Suppose \(x_1\not\in\mathsf{supp}\left(X_1\right)\).
- We could not compute \(\mathbb{E}\left(Y|X_1=x_1\right)\) even if we had access to the entire population. This is a situation where \(\mathbb{E}\left(Y|X_1=x_1\right)\) cannot be identified.
- But we could still compute \(\beta_{0}^{*}+\beta_{1}^{*}x_1\).
- There are many other shape restrictions you can think of in order to make predictions outside of the support of \(X\).

Exercises

(Wooldridge) Let \(Y\) and \(X_1\) be random variables such that \[\mathbb{E}\left(Y|X_1\right)=\delta_{0}+\delta_{1}\left(X_1-\mu\right)+\delta_{2}\left(X_1-\mu\right)^{2}\] where \(\mu=\mathbb{E}\left(X_1\right)\).
- Find \(\dfrac{\partial\mathbb{E}\left(Y|X_1\right)}{\partial X_1}\) and comment on how it depends on \(X_1\).
- Show that \(\delta_{1}\) is equal to \(\dfrac{\partial\mathbb{E}\left(Y|X_1\right)}{\partial X_1}\) averaged across the distribution of \(X_1\).
- Compute the best linear predictor using \(X=(1,X_1)^\prime\) under an MSE criterion.
- Discuss what happens when \(X_1\) has a symmetric distribution.
(Angrist and Pischke) Prove that the best linear predictor is the best linear approximation of the CEF, i.e. show that the unique solution to the following minimization problem: \[\min_{a,b}\mathbb{E}\left[\left(\mathbb{E}\left(Y|X_1\right)-\left(a+bX_1\right)\right)^{2}\right]\]is the one where \(a\) and \(b\) are the optimal coefficients from the best linear prediction of \(Y\) given \(X_1\).

Should we use the CEF or the best linear predictor? Illustrating with real data

Estimation and inference could be tricky for the CEF.
- Curse of dimensionality: What happens when you have other random variables in addition to \(X_1\)?
- Tuning parameters: The difference between the discrete and continuous case matters.
Three illustrations:
- Relation between schooling and log earnings: Data from the US Current Population Survey, obtained from 528 persons interviewed in May 1985.
- Relation between father’s height and son’s height: Data collected by Karl Pearson and Alice Lee from 1903! Data were obtained from 1078 father and son pairs.
- Relation between district income and district average test scores for 220 elementary public school districts in Massachusetts in 1998.

The relation between schooling and log earnings

The relation between father’s height and son’s height

The relation between father’s height and son’s height: Strip by strip

What are the values of the son’s heights within the strip? A stem-and-leaf plot of the strip should give you a sense of what it looks like.

strip<-subset(data, (data$Father>=71.5 & data$Father<=72.5))
stem(strip$Son)

## 
##   The decimal point is at the |
## 
##   64 | 5
##   66 | 479
##   68 | 001334467712237
##   70 | 00023479901244555888
##   72 | 011236688902668
##   74 | 06
##   76 | 4

strip<-subset(data, (data$Father>=61.5 & data$Father<=62.5))
stem(strip$Son)

## 
##   The decimal point is at the |
## 
##   62 | 8
##   63 | 13789
##   64 | 08
##   65 | 28
##   66 | 069
##   67 | 24677

The relation between father’s height and son’s height: Strip by strip

The relation between district income and test scores

Revisiting the unbiasedness question

In order to show that \(\widehat{\beta}_1\) is unbiased for \(\beta_1^*\), we need to somehow impose \[\mathbb{E}\left(\left.u_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right)=0,\] or \[\mathbb{E}\left(\left.u_{t}\right|X_{1t}\right)=0.\] under IID conditions.
But this means that \[\mathbb{E}\left(\left.Y_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right)=\beta_0^*+\beta_1^* X_{1t},\] or \[\mathbb{E}\left(\left.Y_{t}\right|X_{1t}\right)=\beta_0^*+\beta_1^* X_{1t}.\] under IID conditions.

Pay attention to what has changed…

To show unbiasedness, we have to assume that the CEF is actually equal to the best linear predictor under IID conditions. Without IID conditions, we actually need something stronger called strict exogeneity.
Therefore, \(\beta_0^*\) and \(\beta_1^*\) have more than the usual meaning as the coefficients of the best linear predictor.
- In the textbook, the notation changes to \(\beta_0^o\) and \(\beta_1^o\).
- In the textbook, the notation for the prediction error also changes from \(u\) to \(\varepsilon\).
In addition, the “modern” version of the linear regression model turns into something more classical.
- In the language of Chapter 2, \(Y=\beta_0^o+\beta_1^o X_{1}+\varepsilon\) now has correct specification in conditional mean (compare with Definition 2.5).
- In the language of Chapter 3, we have the combination of linearity in Assumption 3.1 and strict exogeneity in Assumption 3.2, i.e., \(Y_t=\beta_0^o+\beta_1^o X_{1t}+\varepsilon_t\) where \(\mathbb{E}\left(\left.\varepsilon_{t}\right|X_{11},X_{12},\ldots,X_{1n}\right)=0\).

Revisiting the bivariate normal case

This case allows us to justify that the CEF coincides with the best linear predictor.
You can show that when \[\left(\begin{array}{c} X_{1}\\ Y \end{array}\right)\sim N\left(\left(\begin{array}{c} \mu_{1}\\ \mu_{Y} \end{array}\right),\left(\begin{array}{cc} \sigma_{1}^{2} & \rho\sigma_{1}\sigma_{Y}\\ \rho\sigma_{1}\sigma_{Y} & \sigma_{Y}^{2} \end{array}\right)\right),\] we must have \[\mathbb{E}\left(Y|X_1\right)=\mu_Y+\rho\frac{\sigma_Y}{\sigma_1}\left(X_1-\mu_1\right)=\underbrace{\mu_Y-\rho\frac{\sigma_Y}{\sigma_1}\mu_1}_{\beta_0^o}+\underbrace{\rho\frac{\sigma_Y}{\sigma_1}}_{\beta_1^o}X_1.\]
In addition, the conditional variance is actually a constant under bivariate normality: \[\mathsf{Var}\left(Y|X_1\right)=\left(1-\rho^2\right)\sigma_Y^2.\]
Finally, the conditional distribution is also normally distributed, so that: \[Y|X_1=x_1 \sim N\left(\beta_0^o+\beta_1^ox_1, \left(1-\rho^2\right)\sigma_Y^2\right).\]

Revisiting the bivariate normal case, continued

We can now talk about another version of the linear regression model, i.e., the normal linear regression model from the perspective of Chapter 2.
Take note that we can already write \[Y=\beta_0^o+\beta_1^o X_1+\varepsilon\] because we have correct specification of the conditional mean.
Because of the bivariate normality of \(\left(Y,X_1\right)\), then it follows that \[\varepsilon |X_1 \sim N\left(0,\left(1-\rho^2\right)\sigma_Y^2\right).\]
In the next slide, you will see some pictures of the bivariate normal for the case where \(\mu_1=\mu_Y=0\) and \(\sigma^2_1=\sigma^2_Y=1\). Therefore, the only thing that changes in the pictures is \(\rho\).
- \(\rho\) takes values \(\{-0.9,-0.6,0,0.6,0.9,0.99\}\).
- \(X_1\) is on the horizontal axis and \(Y\) is on the vertical axis.
- The dashed lines represent \(\mathbb{E}\left(Y|X_1\right)\) and the dotted lines represent \(\mathbb{E}\left(X_1|Y\right)\).

Contour plots for the bivariate normal

Exercises

From the perspective of Chapter 3, the starting point is a correctly specified linear regression model. The normal linear regression model is an example. It combines Assumptions 3.1, 3.2, 3.4, and 3.5. Note that Assumption 3.5 directly implies 3.2 and 3.4.
- What happens to 3.2, 3.4, and 3.5 again under IID conditions?
- When we take 3.1, 3.2, 3.4, 3.5 together, do you know the form of \(\beta_0^o\), \(\beta_1^o\), and \(\sigma^2\)?
(Hansen) Suppose that \(Y\) is a discrete random variable taking values only on non-negative integers. Furthermore, the conditional distribution of \(Y\) given \(X_1\) is Poisson, i.e., \[\Pr\left(Y=j|X_1\right)=\dfrac{\exp\left(-\alpha_0-\alpha_1X_1\right)\left(\alpha_0+\alpha_1 X_1\right)^{j}}{j!},\;j=0,1,2,\ldots\]
- Compute \(\mathbb{E}\left(Y|X_1\right)\) and \(\mathsf{Var}\left(Y|X_1\right)\).
- Will a linear regression model be justified in this situation?

Exercises

(Based on Santos Silva and Tenreyro (2006); Should I take logarithms?) Suppose that \(Y\) is a nonnegative scalar continuous variable generated by \(\log Y=g\left(X_1\right)+\varepsilon\). Assume here that \(X_1\) is a positively-valued random variable. Define \(a\left(X_1\right)=\mathbb{E}\left(\exp\left(\varepsilon\right)|X_1\right)\).
- Find an expression for \(\mathbb{E}\left(Y|X_1\right)\).
- Show that the elasticity of \(\mathbb{E}\left(Y|X_1\right)\) with respect to \(X_1\) is equal to \[\dfrac{\partial g\left(X_1\right)}{\partial X_1}X_1+\dfrac{\partial a\left(X\right)}{\partial X_1}\dfrac{X_1}{a\left(X_1\right)}.\]
- Assume that \(\varepsilon\) is a CEF error. Show that \[\dfrac{\partial\mathbb{E}\left(\log Y|X_1\right)}{\partial\log X_1}=\dfrac{\partial g\left(X_1\right)}{\partial X_1}X_1.\]
- If the target is the elasticity of \(\mathbb{E}\left(Y|X_1\right)\) with respect to \(X_1\), can we recover this object based on your previous results?

Generalizing to more random variables

Predict scalar \(Y\) when you have information about a vector \(X=\left(1,X_{1},\ldots,X_{k}\right)^{\prime}\). What stays the same? What changes?
For the CEF:
- \(\mathbb{E}\left(Y|X\right)\) is still the best predictor under squared loss (the proof stays the same with minor notation changes).
- The properties of the CEF error \(\varepsilon\) still stay the same (again with minor notation changes).
For the best linear predictor:
- The choice set is now the set of all \(\beta=\left(\beta_{0},\beta_{1}.\ldots,\beta_{k}\right)^{\prime}\in\mathbb{R}^{k+1}\) such that the prediction rule has the form \(\beta_{0}+\beta_{1}X_{1}+\cdots+\beta_{k}X_{k}=X^{\prime}\beta\)
- The properties of the prediction error \(u\) still stay the same (again with minor notation changes).

Generalizing to more random variables, continued

For the best linear predictor, the system of equations before\[\left(\begin{array}{cc} 1 & \mathbb{E}\left(X_1\right)\\ \mathbb{E}\left(X_1\right) & \mathbb{E}\left(X^{2}_1\right) \end{array}\right)\left(\begin{array}{c} \beta_{0}^{*}\\ \beta_{1}^{*} \end{array}\right) = \left(\begin{array}{c} \mathbb{E}\left(Y\right)\\ \mathbb{E}\left(X_1Y\right) \end{array}\right)\] extends to \[\underbrace{\left(\begin{array}{cccc} 1 & \mathbb{E}\left(X_{1}\right) & \cdots & \mathbb{E}\left(X_{k}\right)\\ \mathbb{E}\left(X_{1}\right) & \mathbb{E}\left(X_{1}^{2}\right) & \cdots & \mathbb{E}\left(X_{1}X_{k}\right)\\ \vdots & \vdots & \ddots & \vdots\\ \mathbb{E}\left(X_{k}\right) & \mathbb{E}\left(X_{k}X_{1}\right) & \cdots & \mathbb{E}\left(X_{k}^{2}\right) \end{array}\right)}_{Q=\mathbb{E}\left(XX^{\prime}\right)}\underbrace{\left(\begin{array}{c} \beta_{0}^{*}\\ \beta_{1}^{*}\\ \vdots\\ \beta_{k}^{*} \end{array}\right)}_{\beta^{*}} = \underbrace{\left(\begin{array}{c} \mathbb{E}\left(Y\right)\\ \mathbb{E}\left(X_{1}Y\right)\\ \vdots\\ \mathbb{E}\left(X_{k}Y\right) \end{array}\right)}_{\mathbb{E}\left(XY\right)}.\]
Thus, \(\beta^{*}=\left[\mathbb{E}\left(XX^{\prime}\right)\right]^{-1}\mathbb{E}\left(XY\right)\), provided \(\mathbb{E}\left(XX^{\prime}\right)\) is invertible.

The meaning of the word “linear”

One way to think of “linear” is when we say that the “CEF is a linear function of X”.
But actually the meaning of “linear” should be clear from \(X^{\prime}\beta\) meaning linear in \(\beta\). That means \(X\) can contain nonlinearities in terms of the variables.
For example: transformations (taking logarithms), interaction terms (products of variables in \(X\)), …, and many more, as long as you have linear in parameters.
We will talk about interpreting coefficients in the presence of transformations, interactions, and many more next time. In fact, you have already encountered issues with interpretation in a previous exercise.

The regression anatomy

We now try to give an interpretation of the last component of \(\beta^{*}\), for concreteness (this is much more general than it appears). The last coefficient has the following interpretation: \[\beta_{k}^{*}=\dfrac{\mathsf{Cov}\left(Y,v_{k}\right)}{\mathsf{Var}\left(v_{k}\right)}.\]
Suppose you have found the optimal coefficients of the best linear predictor of \(X_{k}\) (a scalar r.v.) using \(X_{-k}^{\prime}=\left(1,X_{1},X_{2},\ldots,X_{k-1}\right)\). We can always write \[X_{k}=X_{-k}^{\prime}\delta^{*}+v_{k},\] where \(v_{k}\) is the error from best linear prediction. We must also have \[Y=X^{\prime}\beta^{*}+u.\] where \(u\) is another error from best linear prediction.

The regression anatomy, continued

To prove that the last coefficient has the interpretation shown earlier, show that \[\begin{aligned}\mathsf{Cov}\left(Y,v_{k}\right) & =\beta_{k}^{*}\mathsf{Cov}\left(X_{k},v_{k}\right)+\mathsf{Cov}\left(u,v_{k}\right)=\beta_{k}^{*}\mathsf{Var}\left(v_{k}\right)+\mathsf{Cov}\left(u,v_{k}\right)\end{aligned}\] and that \[\mathsf{Cov}\left(u,v_{k}\right)=\mathsf{Cov}\left(u,X_{k}-X_{-k}^{\prime}\delta^{*}\right)=0.\] After this, the result follows.
Suppose you have found the coefficients of the best linear predictor of \(Y\) using \(X_{-k}\) instead so that \[Y=X_{-k}^{\prime}\gamma^{*}+\eta.\] Here \(\eta\) is a best linear predictor error again. Try to show that \[\beta_{k}^{*}=\dfrac{\mathsf{Cov}\left(\eta,v_{k}\right)}{\mathsf{Var}\left(v_{k}\right)}.\]

Relationship between a short and long regression

Suppose you have a “modern” version of the linear regression model \(Y=\beta_0^*+\beta_1^*X_1+\beta_2^*X_2+u\).
We can always write \(X_2=\gamma_0^* +\gamma_1^*X_1+v\). Therefore, \[\begin{eqnarray} Y &=& \beta_0^*+\beta_1^*X_1+\beta_2^* \left(\gamma_0^* +\gamma_1^*X_1+v\right) +u \\ Y &=& \underbrace{\beta_0^*+\beta_2^*\gamma_0^*}_{\alpha_0^*} + (\underbrace{\beta_1^*+\beta_2^*\gamma_1^*}_{\alpha_1^*}) X_1+ \underbrace{\beta_2^*v+u}_{w} \end{eqnarray}\]
As a result, we have \(Y=\alpha_0^*+\alpha_1^* X_1+w\). Convince yourself that this fits the “modern” version of the linear regression model.
There is also a relationship between the coefficients of the short (only \(X_1\)) regression and the long (includes \(X_1\) and \(X_2\)) regression. Try to interpret the relationship.
We will return to the question of whether to include or exclude \(X_2\) later.

The meaning of the word “regression”

The CEF is actually also called a regression function or, more specifically, a mean regression function.
In fact, regression refers to any feature of the conditional distribution of \(Y\) given \(X\). See Manski (1991).
- Exercise 2.7 (marked 2.8 in the main textbook) is about median regression.
Until now, people debate about the differences among the terminology surrounding linear regression and linear regression model.
- In our context, \(Y=X^\prime\beta^*+u\) can be thought of as linear regression.
- The correctly specified linear regression is really a model, because it has additional assumptions.
- I called \(Y=X^\prime\beta^*+u\) the “modern” version of the linear regression model, as a compromise to match with what you see in textbooks.

Exercises from the textbook

My suggestion is to focus on 2.1 to 2.6, then 2.8 to 2.13. These are the more important as they let you build a minimum level of skills expected in exams.
2.4 and 2.5 repeat textbook material: nothing new here.
2.3 is a bit heavy on the algebra behind normal distributions.
2.11 is something you will see in Chapter 5 again so it makes sense to do it.

Conditional expectations

Motivation: Adding flexibility to linear prediction

Motivation: Finite-sample unbiasedness of \(\widehat{\beta}_{1}\)

Motivation: Resolving the unbiasedness question

Predictions with additional information, revisited

Conditional expectations

Law of iterated expectations and law of total variance

Errors and MSE from best prediction

Predictive interpretation of the law of total variance

The CEF and the best linear predictor coud be different from each other.

\(X_1\sim U\left(0,1\right)\)

\(X_1\sim U\left(-1,1\right)\)

\(X_1\sim f_{X_1}\)

But, they can be the same.

Should we use the CEF or the best linear predictor?

Exercises

Should we use the CEF or the best linear predictor? Illustrating with real data

The relation between schooling and log earnings

The relation between father’s height and son’s height

The relation between father’s height and son’s height: Strip by strip

The relation between father’s height and son’s height: Strip by strip

The relation between district income and test scores

Revisiting the unbiasedness question

Pay attention to what has changed…

Revisiting the bivariate normal case

Revisiting the bivariate normal case, continued

Contour plots for the bivariate normal

Exercises

Exercises

Generalizing to more random variables

Generalizing to more random variables, continued

The meaning of the word “linear”

The regression anatomy

The regression anatomy, continued

Relationship between a short and long regression

The meaning of the word “regression”

Exercises from the textbook